cb_explore input format : Use of providing probability value in training - vowpalwabbit

The cb_explore input format requires specifying action:cost:action_probability for each example.
However the cb algorithms within are already trying to learn the optimal policy i.e. probability for each action from the data. Then, why does it need the probability of each action in the input? Is it just for initialization?

If I understand correctly, you are asking why the label associated with cb_explore is a set of action/probability pairs.
The probability of the label action is used as an importance weight for training. This has the effect of amplifying the updates for actions that are played less frequently, making them less likely to be drowned out by actions played more frequently.
As well, this type of label is very useful during predict-time, because it generates a log that can be used to perform unbiased counterfactual analyses. In other words, by logging the probability of playing each of the actions before sampling (see cb_sample - this implements how a single action/probability vector is sampled, as for example in the ccb reduction: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/cb_sample.cc#L37), we can then use the log to train another policy, and compare how it performs against the original.
See the "A Multi-World Testing Decision Service" paper to describe the mechanism to do unbiased offline experimentation with logged data: https://arxiv.org/pdf/1606.03966v1.pdf

Related

Assessing doc2vec accuracy

I am trying to assess a doc2vec model based on the code from here. Basically, I want to know the percentual of inferred documents are found to be most similar to itself. This is my current code an:
for doc_id, doc in enumerate(cur.execute('SELECT Text FROM Patents')):
docs += 1
doc = clean_text(doc)
inferred_vector = model.infer_vector(doc)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims].index(doc_id)
ranks.append(rank)
counter = collections.Counter(ranks)
accuracy = counter[0] / docs
This code works perfectly with smaller datasets. However, since I have a huge file with millions of documents, this code becomes too slow, it would take months to compute. I profiled my code and most of the time is consumed by the following line: sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)).
If I am not mistaken, this is having to measure each document to every other document. I think computation time might be massively reduced if I change this to topn=1 instead since the only thing I want to know is if the most similar document is itself or not. Doing this will basically take each doc (i.e., inferred_vector), measure its most similar document (i.e., topn=1), and then I just see if it is itself or not. How could I implement this? Any help or idea is welcome.
To have most_similar() return only the single most-similar document, that is as simple as specifying topn=1.
However, to know which one document of the millions is the most-similar to a single target vector, the similarities to all the candidates must be calculated & sorted. (If even one document was left out, it might have been the top-ranked one!)
Making sure absolutely no virtual-memory swapping is happening will help ensure that brute-force comparison happens as fast as possible, all in RAM – but with millions of docs, it will still be time-consuming.
What you're attempting is a fairly simple "self-check" as to whether training led to self-consistent model: whether the re-inference of a document creates a vector "very close to" the same doc-vector left over from bulk training. Failing that will indicate some big problems in doc-prep or training, but it's not a true measure of the model's "accuracy" for any real task, and the model's value is best evaluated against your intended use.
Also, because this "re-inference self-check" is just a crude sanity check, there's no real need to do it for every document. Picking a thousand (or ten thousand, or whatever) random documents will give you a representative idea of whether most of the re-inferred vectors have this quality, or not.
Similarly, you could simply check the similarity of the re-inferred vector against the single in-model vector for that same document-ID, and check whether they are "similar enough". (This will be much faster, but could also be done on just a random sample of docs.) There's no magic proper threshold for "similar enough"; you'd have to pick one that seems to match your other goals. For example, using scikit-learn's cosine_similarity() to compare the two vectors:
from sklearn.metrics.pairwise import cosine_similarity
# ...
inferred_vector = model.infer_vector(doc_words)
looked_up_vector = model.dv[doc_id]
self_similarity = cosine_similarity([inferred_vector], [looked_up_vector])[0]
# then check that value against some threshold
(You have to wrap the single vectors in lists as arguments to cosine_similarity(), then access the 0th element of the return value, because it is designed to usually work on larger lists of vectors.)
With this calculation, you wouldn't know if, for example, some of the other stored-doc-vectors are a little closer to your inferred target - but that may not be that important, anyway. The docs might be really similar! And while the original "closest to itself" self-check will fail miserably if there were major defects in training, even a well-trained model will likely have some cases where natural model jitter prevents a "closest to itself" for every document. (With more documents inside the same number of dimensions, or certain corpuses with lots of very-similar documents, this would become more common... but not be a concerning indicator of any model problems.)

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

Gesture detection algorithm based on discrete points

I am trying to solve the problem of matching a human generated gesture with a known gesture. The human generated gesture will be represented by a sequence of points that will need to be interpolated into a path and compared to the existing path. The image below shows what I am trying to compare
Can you please help point me in the right direction with resources or concepts that I can read into to build an algorithm to match these two paths? I have no experience in doing this before so any insights will be appreciated.
Receiving input
Measure input on some interval. Every xx milliseconds, measure the coordinates of the user's hand/finger/stylus.
Storing patterns and input
Patterns (expected input)
Modify the pattern. It's currently a continuous "function," but measuring input as such is difficult. Use discrete points at some interval. This interval can be very short, depending on how accurate you require gestures to be. In fact, it should be very short; the more points to compare against, the better (I'll explain this a little better in the next section).
Input (received from user)
When input is measured, the input-measurement interval needs to be short enough that each received consecutive pair of input points is close enough to compare to the expected points.
Imagine that the user performs some gesture very quickly (and completes it in the time your input-reader reads only three frames). The pattern and input cannot be reliably compared:
To avoid this, your input-reader must have a relatively short interval. However, this probably isn't a huge concern, since most hardware can read even the fastest human gestures.
Back to patterns: they should always be detailed enough to include more points than any possible input. More expected points allow for better accuracy. If a user moves slowly, the input will have more points; if they move quickly, the input will have fewer.
Consider this: completing a single gesture gives you half as many input frames as the pattern includes. The user has moved at a "normal" speed, so, to simplify the algorithm, you can "dumb down" your pattern by a factor of 2, then compare input coordinates to pattern coordinates directly.
This method is easier than the alternative that comes to mind (see next section).
Pattern "density" (coordinate frequency)
If you have a small number of expected points, you'll have to make approximations to match input.
Here's an "extreme" example, but it proves the concept. Given this pattern and input:
Point 3r can't be reliably compared with point 2 or point 3, so you'd have to use some function of points 2, 3, and 3r, to determine if 3r is on the correct path. Now consider the same input, but where the pattern has higher density:
Now, you don't have to compromise, since 3r is essentially definitely on the gesture's pattern. A slight reduction in the pattern's density causes it to match input quite well.
Positioning
Relative positioning
Instead of comparing absolute positions (such as on a touchscreen), you probably want the gesture to be allowed anywhere in some plane of space. To that end, you must relate the start point of the input to some coordinate system.
Normalization
To be user-friendly, allow gestures to be done in a range of "sizes". You don't want to compare raw data, because chances are the size of the plane of the input doesn't match the size of the plane of the pattern.
Normalize the input in the x- and y-direction to match the size of your pattern. Do not maintain aspect ratio.
Relate the input to a coordinate system, as per previous bullet
Find the largest horizontal and vertical distance between any two input points (call them RecMaxH and RecMaxV)
Find the largest horizontal and vertical distance between any two pattern points (call them ExpMaxH and ExpMaxV)
Multiply all input points' x-coordinates by ExpMaxH/RecMaxH
Multiple all input points' y-coordinates by ExpMaxV/RecMaxV
You now have two more-similar sets of points that can be compared. Normalization can be much more detailed than this; for instance, you could normalize sets of 3 points at a time to get incredibly similar images (but you would probably have to do this for each pattern, then compare the sum of all differences to find the most likely matching pattern).
I suggest storing all gestures' pattern as a graph the same size; that reduces computation when measuring closeness of input to possible pattern matches.
When to measure input
User-driven
Imagine a button that, when clicked/activated, causes your program to begin measuring inputs. This would be similar to Google's Voice Search, which doesn't constantly record and search; instead, you say "Ok Jarvis" or click the handy microphone icon and begin speaking your query.
Benefits:
Simplifies algorithm
Prevents user from unintentionally triggering an event. Imagine if every word you spoke was analyzed and sent to Google as part of a search query. Sometimes you just don't mean to do anything.
Drawbacks:
Less user-friendly. User must go out of his/her way to trigger recording for gestures.
If you're writing, for instance, a gesture-search (ridiculous example), this is probably the better method to implement. Nobody wants every move they make interpreted as an action in your application. However, if you're writing a Kinect-style or gesture-based game, you probably want to be constantly recording and looking for gestures.
Constant
Your program constatly records gesture coordinates at the specified interval (this could be reduced to "records if there's movement, otherwise doesn't store coordinates"). You must make a decision: how many "frames" will you record until deciding that the currently-stored motion is not a recognized gesture?
Store coordinates in a buffer: a queue 1.5 or 2 (to be cautious) times as long as the largest number of frames you're willing to record.
Once you determine that there exists in this buffer a sequence of frames that match a pattern, execute that gesture's result, and clear the queue.
If there's the possibility that the next gesture is an "option" for the most-recent gesture, record the application state as "currently waiting on option for ____ gesture," and wait for the option to appear.
If it's determined that the first x frames in the buffer cannot possibly match a pattern (because of their sequence or positioning), remove them from the queue.
Benefits:
Allows for more dynamic handling of gestures
User input recognized automatically
Drawbacks:
More complex algorithm
Heavier computation
If you're writing a game that runs based on real-time input, this is probably the right choice.
Algorithm
If you're using user-driven recognition:
Record all input in the allowed timeframe (or until the user signifies that they're done)
To evaluate the input, reduce the density of your pattern to match that of the input
Relate the input to a coordinate system
Normalize input
Use a method of function comparison (looseness of this calculation is up to you: standard deviation, variance, total difference in values, etc.), and choose the least-different possibility.
If no possibility is similar enough to meet your required threshold (you must decide this), don't accept the input.
If you're using constant measuring:
In your buffer, treat the sequence of max_sequence_size (you decide) beginning at every multiple of frame_multiples (you decide) as a possible gesture. For instance, if all of my possible gestures are at most 20 frames long, and I believe every 5 frames a new gesture could be starting (and I won't lose any critical data in those 5 frames), I'll compare each portions of the buffer to all possible gestures (portions from 0-19, 5-24, 10-29, etc.). This is heavier computing when frame_multiples decreases. For perfect measurement, frame_multiples is 1 (but this is likely not reasonable).
Hope you've enjoyed reading this answer as much as I enjoyed writing it. I've never done this before, but you've piqued my interest in a way that doesn't often happen. Please edit and improve my answer! If there's a portion that seems incomplete, add to it. I'm very curious in (particularly, more-experienced) responses and criticism.

Binary values in Collaborative Filtering

Can the values in User-Item matrix be binary values like 0 and 1 which indicate “didn’t buy”-vs-“bought”?
And if apply latent factor model on the matrix, can the predicted value (for example 0.8) stand for the probability of user's behavior(i.e. didn’t buy or bought)?
Yes, it is quite common to have implicit feedback to represent ratings. One slight pitfall with the suggestion you made would be if 0 means the user saw the item but chose not to buy it, or the user never even saw the item (i.e gave no feedback.)
Typically the value output from your recommendation algorithm isn't a probability of a purchase, but rather a numerical score used to rank that item versus all other potential items. This way you can identify the top X items to recommend to a user.
You can use standard collaborative filtering on the type of data you discussed, and also using factorisation techniques.

Resources