How to demo Vowpal Wabbit's contextual bandits in real online mode? [closed] - vowpalwabbit

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Following the available docs and resources, it is not really clear how to accomplish a simple getting-started flow where you'd launch Vowpal Wabbit as a daemon (possibly even without any pre-learnt model) and have it online learn and explore ― I'm looking for a flow where I'd feed in a context, get back a recommendation, and feed back a cost/reward.
So let me skip the technical descriptions of what's been tried and simply ask for a clear demonstration regarding what I might consider essential in this vein ―
How to demo through a daemon, that learning is taking place, not in offline mode from batch data but purely from online interaction? any good suggestions?
How to report back a cost/reward following a selected action, in daemon mode? once per action? in bulk? and either way, how?
Somewhat related ― would you recommend a live system using the daemon, for contextual bandits? or rather some of the language API?
Can you alternatively point at where the server code sits inside the gigantic code base? it can be a good place to start systematically exploring from.
I typically get a distribution (the size of the number of allowed actions) as a reply for every input sent. Typically the same distribution regardless of what I sent in. Maybe it takes a whole learning epoch with the default --cb_explore algorithm, I wouldn't know, and am not sure the epoch duration can be set from outside.
I understand that so much has been put into enabling learning from past interactions, and from cbfied data. However I think there should also be some available explanation clearing those more-or-less pragmatic essentials above.
Thanks so much!

There it goes. This flow only necessitates a subset of the Vowpal Wabbit input format. First off after successful installation, we start off a Vowpal Wabbit daemon:
vw --cb_explore 2 --daemon --port 26542 --save_resume
In the above, we tell VW to start a Contextual Bandit model serving daemon, without any upfront training having been provided through old policy data. The model will be the default contextual bandits model of VW, and it will assume as above specified, just two actions to choose from. Vowpal will initially assign suggested actions by random, and will over time approach the optimal policy.
Let's just check the daemon is up: pgrep 'vw.*' should return a list of processes.
At any time later if we wanted to stop the daemon and start it again we would simply pkill -9 -f 'vw.*--port 26542'.
Now let us simulate decision points and costs obtained for the actions taken. In the following I use the terminal way of dispatching messages to the daemon, but you can exercise this with a tool like postman or your own code:
echo " | a b " | netcat localhost 26542
Here we just told Vowpal to suggest what action we should take for a context comprising the feature set (a, b).
Vowpal succinctly replies not with a chosen action, but with a distribution of predicted costs per each of the two actions our model was instructed to chose from:
0.975000 0.025000
These are of course only the result of some random initialization, as it's not seen any costs yet! Now our application using Vowpal is expected to choose uniformly at random according to this distribution ― this part is not implemented by Vowpal but left to application code. The Contextual Bandits model relies on us sampling from this distribution for choosing the action to be played against the environment ― if we do not follow this expectation ― the algorithm may not accomplish its learning.
So imagine we sampled from this distribution, and got action 1, then executed that action in the real-world environment (for the same context a b we asked Vowpal to recommend for). Imagine we got back a cost 0.7 this time. We have to communicate this cost back to Vowpal as feedback:
echo " 1:0.7:1 | a b " | netcat localhost 26542
Vowpal got our feedback, and gives us back its updated prediction for this context:
0.975000 0.025000
We don't care about it right now unless we wish to get a recommendation for the exact same context again, but we get its updated recommendation anyway.
Obviously it's the same recommendation as before, as our single feedback so far isn't enough for the model for learning anything. Repeated many times, and for different context features, the predictions returned from Vowpal will adapt and change. Repeat this process for many times and for many different contexts, and the model will begin shifting its predictions per what it has learned.
Note I mention costs and not rewards here, as unlike much of the literature of the algorithms implemented in Vowpal, the command-line version at least, takes costs as feedback, not rewards.

Related

How `vw --audit` internally computes the weights of the features?

In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
From what I understand vowpawabbit tries to fit one linear model to each arm.
So if weights were calculated using an average across all the arms, then they would correlate with getting a reward generally, instead of which features makes the model pick one variant from another.
I am interested know out how they are calculated to see how I can interpret the results obtained. I tried searching its Github repository but could not find anything meaningful.
I am interested know out how they are calculated to see how I can interpret the results obtained.
Unfortunately knowing the first does not lead to knowing the second.
Your question is concerned with contextual bandits, but it is important to note that interpreting model parameters is an issue that also occurs in supervised learning. Machine learning has made progress recently (i.e., my lifetime) largely by focusing concern on quality of predictions rather than meaningfulness of model parameters. In a blog post, Phoebe Wong outlines the issue while being entertaining.
The bottom line is that our models are not causal, so you simply cannot conclude because "the weight of feature X is for arm A is large means that if I were to intervene in the system and increase this feature value that I will get more reward for playing arm A".
We are currently working on tools for model inspection that leverage techniques such as permutation importance that will help you answer questions like "if I were to stop using a particular feature how would the frequency of playing each arm change for the trained policy". We're hoping that is helpful information.
Having said all that, let me try to answer your original question ...
In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
The format is documented here. Assuming you are using --cb (not --cb_adf) then there are a fixed number of arms and so the offset field will increment over the arms. So for an example like
1:2:0.4 |foo bar
with --cb 4 you'll get an audit output with namespace of foo, feature of bar, and offset of 0, 1, 2, and 3.
Interpreting the output when using --cb_adf is possible but difficult to explain succinctly.
From what I understand vowpawabbit tries to fit one linear model to each arm.
Shorter answer: With --cb_type dm, essentially VW independently tries to predict the average reward for each arm using only examples where the policy played that arm. So the weight you get from audit at a particular offset N is analogous to what you would get from a supervised learning model trained to predict reward on a subset of the historical data consisting solely of times the historical policy played arm N. With other --cb_type settings the interpretation is more complicated.
Longer answer: "Linear model" refers to the representation being used. VW can incorporate nonlinearities into the model but let's ignore that for now. "Fit" is where some important details are. VW takes the partial feedback information of a CB problem (partial feedback = "for this example you don't know the reward of the arms not pulled") and reduces it to a full feedback supervised learning problem (full feedback = "for this example you do the reward of all arms"). The --cb_type argument selects the reduction strategy. There are several papers on the topic, a good place to start is Dudik et. al. and then look for papers that cite this paper. In terms of code, ultimately things are grounded here, but the code is written more for performance than intelligibility.

Contextual Bandit using Vowpal wabbit

In this case, one of the inputs is the probability of choosing an arm/action but how do we find that probability?
Isn't finding that probability itself a big task in hand?
Supplying the probability means you are taking a scenario where you are feeding actions taken historically, e.g. from a log, rather than performing the real online scenario. This is useful because (at least some of) Vowpal's Contextual Bandits models can be bootstrapped from historical data. Meaning, a Contextual Bandits policy learnt over historical data can outperform one that learns online from scratch ― something that you can do only if you have historical data relevant to the online scenario of yours.
The Wiki page has been recently edited to better reflect that this format generalizes for this case.
Another (contrived) use case for including probabilities might be that you are acting against multiple environments, but in any event to the best of my understanding the probability here can be interpreted as a mere frequency.
As such, my understanding is you do not have to supply the probability part in your input, when not feeding in historical interaction data. Just skip it as in the example here.

What is tuning in machine learning? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am a novice learner of machine learning and am confused by tuning.
What is the purpose of tuning in machine learning? To select the best parameters for an algorithm?
How does tuning works?
Without getting into a technical demonstration that would seem appropriate for Stackoverflow, here are some general thoughts. Essentially, one can argue that the ultimate goal of machine learning is to make a machine system that can automatically build models from data without requiring tedious and time consuming human involvement. As you recognize, one of the difficulties is that learning algorithms (eg. decision trees, random forests, clustering techniques, etc.) require you to set parameters before you use the models (or at least to set constraints on those parameters). How you set those parameters can depend on a whole host of factors. That said, your goal, is usually to set those parameters to so optimal values that enable you to complete a learning task in the best way possible. Thus, tuning an algorithm or machine learning technique, can be simply thought of as process which one goes through in which they optimize the parameters that impact the model in order to enable the algorithm to perform the best (once, of course you have defined what "best" actual is).
To make it more concrete, here are a few examples. If you take a machine learning algorithm for clustering like KNN, you will note that you, as the programmer, must specify the number of K's in your model (or centroids), that are used. How do you do this? You tune the model. There are many ways that you can do this. One of these can be trying many many different values of K for a model, and looking to understand how the inter and intra group error as you very the number of K's in your model.
As another example, let us consider say support vector machine (SVM) classication. SVM classification requires an initial learning phase in which the training data are used
to adjust the classication parameters. This really refers to an initial parameter tuning phase where you, as the programmer, might try to "tune" the models in order to achieve high quality results.
Now, you might be thinking that this process can be difficult, and you are right. In fact, because of the difficulty of determining what optimal model parameters are, some researchers use complex learning algorithms before experimenting adequately with simpler alternatives with better tuned parameters.
In the abstract sense of machine learning, tuning is working with / "learning from" variable data based on some parameters which have been identified to affect system performance as evaluated by some appropriate1 metric. Improved performance reveals which parameter settings are more favorable (tuned) or less favorable (untuned).
Translating this into common sense, tuning is essentially selecting the best parameters for an algorithm to optimize its performance given a working environment such as hardware, specific workloads, etc. And tuning in machine learning is an automated process for doing this.
For example, there is no such thing as a "perfect set" of optimizations for all deployments of an Apache web server. A sysadmin learns from the data "on the job" so to speak and optimizes their own Apache web server configuration as appropriate for its specific environment. Now imagine an automated process for doing this same thing, i.e., a system that can learn from data on its own, which is the definition of machine learning. A system that tunes its own parameters in such a data-based fashion would be an instance of tuning in machine learning.
1 System performance as mentioned here, can be many things, and is much more general than the computers themselves. Performance can be measured by minimizing the number of adjustments needed for a self-driving car to parallel park, or the number of false predictions in autocomplete; or it could be maximizing the time an average visitor spends on a website based on advertisement dimensions, or the number of in-app purchases in Candy Crush.
Cleverly defining what "performance" means in a way that is both meaningful and measurable is key in a successful machine learning system.
A little pedantic but just want to clarify that a parameter is something that is internal to the model (you do not set it). What you are referring to is a hyperparameter.
Different machine learning algorithms have a set of hyperparameters that can be adjusted to improve performance (or make it worse). The most common and maybe simplest way to find the best hyperparameter is through what's known as a grid search (searching across a set of values).
Some examples of hyperparameters include the number of trees for a random forest algorithm, or a value for regularization.
Important note: hyperparameters must be tuned on a separate set of training data. Lot's of new folks to machine learning will modify the hyperparameters on the training data set until they see the best performance on the test dataset. You are essentially overfitting the hyperparameter by doing this.

Beyond item-to-item recommendations

Simple item-to-item recommendation systems are well-known and frequently implemented. An example is the Slope One algorithm. This is fine if the user hasn't rated many items yet, but once they have, I want to offer more finely-grained recommendations. Let's take a music recommendation system as an example, since they are quite popular. If a user is viewing a piece by Mozart, a suggestion for another Mozart piece or Beethoven might be given. But if the user has made many ratings on classical music, we might be able to make a correlation between the items and see that the user dislikes vocals or certain instruments. I'm assuming this would be a two-part process, first part is to find correlations between each users' ratings, the second would be to build the recommendation matrix from these extra data. So the question is, are they any open-source implementations or papers that can be used for each of these steps?
Taste may have something useful. It's moved to the Mahout project:
http://taste.sourceforge.net/
In general, the idea is that given a user's past preferences, you want to predict what they'll select next and recommend it. You build a machine-learning model in which the inputs are what a user has picked in the past and the attributes of each pick. The output is the item(s) they'll pick. You create training data by holding back some of their choices, and using their history to predict the data you held back.
Lots of different machine learning models you can use. Decision trees are common.
One answer is that any recommender system ought to have some of the properties you describe. Initially, recommendations aren't so good and are all over the place. As it learns tastes, the recommendations will come from the area the user likes.
But, the collaborative filtering process you describe is fundamentally not trying to solve the problem you are trying to solve. It is based on user ratings, and two songs aren't rated similarly because they are similar songs -- they're rated similarly just because similar people like them.
What you really need is to define your notion of song-song similarity. Is it based on how the song sounds? the composer? Because it sounds like the notion is not based on ratings, actually. That is 80% of the problem you are trying to solve.
I think the question you are really answering is, what items are most similar to a given item? Given your item similarity, that's an easier problem than recommendation.
Mahout can help with all of these things, except song-song similarity based on its audio -- or at least provide a start and framework for your solution.
There are two techniques that I can think of:
Train a feed-forward artificial neural net using Backpropagation or one of it's successors (e.g. Resilient Propagation).
Use version space learning. This starts with the most general and the most specific hypotheses about what the user likes and narrows them down when new examples are integrated. You can use a hierarchy of terms to describe concepts.
Common characteristics of these methods are:
You need a different function for
each user. This pretty much rules
out efficient database queries when
searching for recommendations.
The function can be updated on the fly
when the user votes for an item.
The dimensions along which you classify
the input data (e.g. has vocals, beats
per minute, musical scales,
whatever) are very critical to the
quality of the classification.
Please note that these suggestions come from university courses in knowledge based systems and artificial neural nets, not from practical experience.

Effort Estimation based on Use Case Points [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
As of now I have done effort estimation based on experience and recently using function points.
I am now exploring UCP, read this article http://www.codeproject.com/KB/architecture/usecasep.aspx. I then checked various other articles based on Use Case Points (UCP). I am not able to find out how exactly it works and is it correct.
For example, I have a login functionality where user provides userid and password and I check against a table in database to allow or deny login. I define a user actor and Login as a Use Case.
As per UCP I categorize Login use case as Simple and the GUI interface as Complex. As per UCP factor table I get 5 and 3 so the total is 15. After applying the technical factor and environmental factor adjustment it becomes 7. If I take productivity factor as 20 then I am getting 140 hours. But I know it will take at most 30 hrs along with documentation and testing efforts.
Am I doing something wrong in defining the Use Case here? UCP says if the interface is GUI then its complex but here the gui is easy enough so should I downgrade that factor? Also factor for simple is 5, should I define another level as Very Simple? But then am I not complicating the matter here?
Ironically, the prototypical two box logon form is much more complicated than a 2 box CRUD form because the logon form needs to be secure and the CRUD form only needs to save to a database table (and read and update and delete).
A logon form needs to decide if where to redirect to, how to cryptographically secure an authentication token, if and how to cache roles, how to or if to deal with dictionary attacks.
I don't know what this converts to in UCP points, I just know that the logon screen in my app has consumed much more time a form with a similar number of buttons and boxes.
Last time I was encouraged to count function points, it was a farce because no one had the time to set up a "function points court" to get rulings on hard to measure things, especially ones that didn't fall neatly into the model that function point counting assumes.
Here's an article talking about Use Case Points - via Normalized Use Case. I think the one factor overlooked in your approach is the productivity which is suppose to be based on past projects. 20 seems to be the average HOWEVER if you are VERY productive (there's a known 10 to 1 ratio of moderate to good programmers) the productivity could be 5 bringing the UCP est. close to what you think it should be. I would suggest looking at past projects, calculating the UCP, getting the total hours and determining what your productivity really is. Productivity being a key factor needs to be calculated for individuals and teams to be able to be used in the estimation effectively.
Part of the issue may be how you're counting transactions. According to the author of UCP, transactions are a "round trip" from the user to the system back to the user; a transaction is finished when the system awaits a new input stimulus. In this case, unless the system is responding...a logon is probably just 1 transaction unless there are several round trips to and from the system.
Check out this link for more info...
http://www.ibm.com/developerworks/rational/library/edge/09/mar09/collaris_dekker/index.html
first note that in a previous work of Ribu he stated that effort for 1 UCP ranges from 15 to 30 hrs (see: http://thoughtoogle-en.blogspot.com/2011/08/software-quotation.html for some details);
second it is clear that this kind of estimation like, also Function Points, is more accurate when there are a lot of use-case and not one. You are not considering for example, startup of the project, project management, creation of environments etc. that are all packed in the 20 hours.
I think there is something wrong in your computation: "I get 5 and 3 so the total is 15". UAW and UUCW must be added, not multiplied.

Resources