Using Softmax Explorer (--cb_explore_adf) for Ranking in VowpalWabbit - vowpalwabbit

I am trying to use VW to perform ranking using the contextual bandit framework, specifically using --cb_explore_adf --softmax --lambda X. The choice of softmax is because, according to VW's docs: "This is a different explorer, which uses the policy not only to predict an action but also predict a score indicating the quality of each action." This quality-related score is what I would like to use for ranking.
The scenario is this: I have a list of items [A, B, C, D], and I would like to sort it in an order that maximizes a pre-defined metric (e.g., CTR). One of the problems, as I see, is that we cannot evaluate the items individually because we can't know for sure which item made the user click or not.
To test some approaches, I've created a dummy dataset. As a way to try and solve the above problem, I am using the entire ordered list as a way to evaluate if a click happens or not (e.g., given the context for user X, he will click if the items are [C, A, B, D]). Then, I reward the items individually according to their position on the list, i.e., reward = 1/P for 0 < P < len(list). Here, the reward for C, A, B, D is 1, 0.5, and 0.25, 0.125, respectively. If there's no click, the reward is zero for all items. The reasoning behind this is that more important items will stabilize on top and less important on the bottom.
Also, one of the difficulties I found was defining a sampling function for this approach. Typically, we're interested in selecting only one option, but here I have to sample multiple times (4 in the example). Because of that, it's not very clear how I should incorporate exploration when sampling items. I have a few ideas:
Copy the probability mass function and assign it to copy_pmf. Draw a random number between 0 and max(copy_pmf) and for each probability value in copy_pmf, increment the sum_prob variable (very similar to the tutorial here:https://vowpalwabbit.org/tutorials/cb_simulation.html). When sum_prob > draw, we add the current item/prob to a list. Then, we remove this probability from copy_pmf, set sum_prob = 0, and draw a new number again between 0 and max(copy_pmf) (which might change or not).
Another option is drawing a random number and, if the maximum probability, i.e., max(pmf) is greater than this number, we exploit. If it isn't, we shuffle the list and return this (explore). This approach requires tuning the lambda parameter, which controls the output pmf (I have seen cases where the max prob is > 0.99, which would mean around a 1% chance of exploring. I have also seen instances where max prob is ~0.5, which is around 50% exploration.
I would like to know if there are any suggestions regarding this problem, specifically sampling and the reward function. Also, if there are any things I might be missing here.
Thank you!

That sounds like something that can be solved by conditional contextual bandits
For demo scenario that you are mentioning each example should have 4 slots.
You can use any exploration algorithm in this case and it is going to be done independently per each slot. Learning objective is average loss over all slots, but decisions are made sequentially from the first slot to the last, so you'll effectively learn the ranking even in case of binary reward here.

Related

Jira's Lexorank algorithm for new stories

I am looking to create a large list of items that allows for easy insertion of new items and for easily changing the position of items within that list. When updating the position of an item, I want to change as few fields as possible regarding the order of items.
After some research, I found that Jira's Lexorank algorithm fulfills all of these needs. Each story in Jira has a 'rank-field' containing a string which is built up of 3 parts: <bucket>|<rank>:<sub-rank>. (I don't know whether these parts have actual names, this is what I will call them for ease of reference)
Examples of valid rank-fields:
0|vmis7l:hl4
0|i000w8:
0|003fhy:zzzzzzzzzzzw68bj
When dragging a card above 0|vmis7l:hl4, the new card will receive rank 0|vmis7l:hl2, which means that only the rank-field for this new card needs to be updated while the entire list can always be sorted on this rank-field. This is rather clever, and I can't imagine that Lexorank is the only algorithm to use this.
Is there a name for this method of sorting used in the sub-rank?
My question is related to the creation of new cards in Jira. Each new card starts with an empty sub-rank, and the rank is always chosen such that the new card is located at the bottom of the list. I've created a bunch of new stories just to see how the rank would change, and it seems that the rank is always incremented by 8 (in base-36).
Does anyone know more specifically how the rank for new cards is generated? Why is it incremented by 8?
I can only imagine that after some time (270 million cards) there are no more ranks to generate, and the system needs to recalculate the rank-field of all cards to make room for additional ranks.
Are there other triggers that require recalculation of all rank-fields?
I suppose the bucket plays a role in this recalculation. I would like to know how?
We are talking about a special kind of indexing here. This is not sorting; it is just preparing items to end up in a certain order in case someone happens to sort them (by whatever sorting algorithm). I know that variants of this kind of indexing have been used in libraries for decades, maybe centuries, to ensure that books belonging together but lacking a common title end up next to each other in the shelves, but I have never heard of a name for it.
The 8 is probably chosen wisely as a compromise, maybe even by analyzing typical use cases. Consider this: If you choose a small increment, e. g. 1, then all tickets will have ranks like [a, b, c, …]. This will be great if you create a lot of tickets (up to 26) in the correct order because then your rank fields keep small (one letter). But as soon as you move a ticket between two other tickets, you will have to add a letter: [a, b] plus a new ticket between them: [a, an, b]. If you expect to have this a lot, you better leave gaps between the ranks: [a, i, q, …], then an additional ticket can get a single letter as well: [a, e, i, q, …]. But of course if you now create lots of tickets in the correct order right in the beginning, you quickly run out of letters: [a, i, q, y, z, za, zi, zq, …]. The 8 probably is a good value which allows for enough gaps between the tickets without increasing the need for many letters too soon. Keep in mind that other scenarios (maybe not Jira tickets which are created manually) might make other values more reasonable.
You are right, the rank fields get recalculated now and then, Lexorank calls this "balancing". Basically, balancing takes place in one of three occasions: ① The ranks are exhausted (largest value reached), ② the ranks are due to user-reranking of tickets too close together ([a, b, i] and something is supposed to go in between a and b), and ③ a balancing is triggered manually in the management page. (Actually, according to the presentation, Lexorank allows for up to three letter ranks, so "too close together" can be something like aaa and aab but the idea is the same.)
The <bucket> part of the rank is increased during balancing, so a messy [0|a, 0|an, 0|b] can become a nice and clean [1|a, 1|i, 1|q] again. The brownbag presentation about Lexorank (as linked by #dandoen in the comments) mentions a round-robin use of <buckets>, so instead of a constant increment (0→1→2→3→…) a 2 is increased modulo 3, so it will turn back to 0 after the 2 (0→1→2→0→…). When comparing the ranks, the sorting algorithm can consider a 0 "greater" than a 2 (it will not be purely lexicographical then, admitted). If now the balancing algorithm works backwards (reorder the last ticket first), this will keep the sorting order intact all the time. (This is just a side aspect, that's why I keep the explanation small, but if this is interesting, ask, and I will elaborate on this.)
Sidenote: Lexorank also keeps track of minimum and maximum values of the ranks. For the functioning of the algorithm itself, this is not necessary.

Matching data based on parameters and constraints

I've been looking into the k nearest neighbors algorithm as I might be developing an application that matches fighters (boxers) in the near future.
The reason for my question, is to figure out which would be the best approach/algorithm to use when matching fighters based on multiple parameters and constraints depending on the rule-set.
The relevant properties of each fighter are the following:
Age (Fighters will be assigned to an agegroup (15, 17, 19, elite)
Weight
Amount of fights
Now there are some rulesets for what can be allowed when matching fighters:
A maximum of 2 years in between the fighters (unless it's elite)
A maximum of 3 kilo's difference in weight
Now obviously the perfect match, would be one where all the attendees gets matched with another boxer that fits within the ruleset.
And the main priority is to match as many fighters with each other as possible.
Is K-nn the way to go or is there a better approach?
If so which?
This is too long for a comment.
For best results with K-nn, I would suggest principal components. These allow you to use many more dimensions and do a pretty good job of spreading the data through the space, to get a good neighborhood.
As for incorporating existing rules, you have two choices. Probably, the best way is to build it into you distance function. Alternatively, you can take a large neighborhood and build it into the combination function.
I would go with k-Nearest Neighbor search. Since your dataset is in a low dimensional space (i.e. 3), I would use CGAL, in order to perform the task.
Now, the only thing you have to do, is to create a distance function like this:
float boxers_dist(Boxer a, Boxer b) {
if(abs(a.year - b.year) > 2 || abs(a.weight - b.weight) > e)
return inf;
// think how you should use the 3 dimensions you have, to compute distance
}
And you are done...now go fight!

Binary values in Collaborative Filtering

Can the values in User-Item matrix be binary values like 0 and 1 which indicate “didn’t buy”-vs-“bought”?
And if apply latent factor model on the matrix, can the predicted value (for example 0.8) stand for the probability of user's behavior(i.e. didn’t buy or bought)?
Yes, it is quite common to have implicit feedback to represent ratings. One slight pitfall with the suggestion you made would be if 0 means the user saw the item but chose not to buy it, or the user never even saw the item (i.e gave no feedback.)
Typically the value output from your recommendation algorithm isn't a probability of a purchase, but rather a numerical score used to rank that item versus all other potential items. This way you can identify the top X items to recommend to a user.
You can use standard collaborative filtering on the type of data you discussed, and also using factorisation techniques.

Algorithm to get probability of reaching a goal

Okay, I'm gonna be as detailed as possible here.
Imagine the user goes through a set of 'options' he can choose. Every time he chooses, he get, say, 4 different options. There are many more options that can appear in those 4 'slots'. Each of those has a certain definite and known probability of appearing. Not all options are equally probable to appear, and some options require others to have already been selected previously - in a complex interdependence tree. (this I have already defined)
When the user chooses one of the 4, he is presented another choice of 4 options. The pool of options is defined again and can depend on what the user has chosen previously.
Among all possible 'options' that can ever appear, there are a certain select few which are special, call them KEY options.
When the program starts, the user is presented the first 4 options. For every one of those 4, the program needs to compute the total probability that the user will 'achieve' all the KEY options in a period of (variable) N choices.
e.g. if there are 4 options altogether the probability of achieving any one of them is exactly 1 since all of them appear right at the beginning.
If anyone can advise me as to what logic i should start with, I'd be very grateful.
I was thinking of counting all possible choice sequences, and counting the ones resulting in KEY options being chosen within N 'steps', but the problem is the probability is not uniform for all of them to appear, and also the pool of options changes as the user chooses and accumulates his options.
I'm having difficulty implementing the well defined probabilities and dependencies of the options into an algorithm that can give sensible total probability. So the user knows each time which of the 4 puts him in the best position to eventually acquire the KEY options.
Any ideas?
EDIT:
here's an example:
say there are 7 options in the pool. option1, ..., option7
option7 requires option6; option6 requires option4 and option5;
option1 thru 5 dont require anything and can appear immediately, with respective probabilities option1.p, ..., option5.p;
the KEY option is, say, option7;
user gets 4 randomly (but weighted) chosen options among 1-5, and the program needs to say something like:
"if you choose (first), you have ##% chance of getting option7 in at most N tries." analogous for the other 3 options.
naturally, for some low N it is impossible to get option7, and for some large N it is certain. N can be chosen but is fixed.
EDIT: So, the point here is NOT the user chooses randomly. Point is - the program suggests which option to choose, as to maximize the probability that eventually, after N steps, the user will be offered all key options.
For the above example; say we choose N = 4. so the program needs to tell us which of the first 4 options that appeared (any 4 among option1-5), which one, when chosen, yields the best chance of obtaining option7. since for option7 you need option6, and for that you need option4 and option5, it is clear that you MUST select either option4 or option5 on the first set of choices. one of them is certain to appear, of course.
Let's say we get this for the first choice {option3, option5, option2, option4}. The program then says:
if you chose option3, you'll never get option7 in 4 steps. p = 0;
if you chose option5, you might get option7, p=....;
... option2, p = 0;
... option4, p = ...;
Whatever we choose, for the next 4 options, the p's are re calculated. Clearly, if we chose option3 or option2, every further choice has exactly 0 probability of getting us to option7. But for option4 and option5, p > 0;
Is it clearer now? I don't know how to getting these probabilities p.
This sounds like a moderately fiddly Markov chain type problem. Create a node for every state; a state has no history, and is just dependent on the possible paths out of it (each weighted with some probability). You put a probability on each node, the chance that the user is in that state, so, for the first step, there will be a 1 his starting node, 0 everywhere else. Then, according to which nodes are adjacent and the chances of getting to them, you iterate to the next step by updating the probabilities on each vertex. So, you can calculate easily which states the user could land on in, say, 15 steps, and the associated probabilities. If you are interested in asymptotic behaviour (what would happen if he could play forever), you make a big pile of linear simultaneous equations and just solve them directly or using some tricks if your tree or graph has a neat form. You often end up with cyclical solutions, where the user could get stuck in a loop, and so on.
If you think the user selects the options at random, and he is always presented the same distribution of options at a node, you model this as a random walk on a graph. There was a recent nice post on calculating terminating probabilities of a particular random walks on the mathematica blog.

Algorithm to calculate a page importance based on its views / comments

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.
For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.
The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:
GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082
You mentioned doing this in an SQL query, so I'll give samples in that.
If you have a table/view Pages, something like this
Pages
-----
page_id:int
views:int - indexed
comments:int - indexed
Then you can order them by writing
SELECT * FROM Pages
ORDER BY
(0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +
(0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))
I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.
The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.
0.3*(score for views/comments today) - live data
0.3*(score for views/comments in the last week)
0.25*(score for views/comments in the last month)
0.15*(score for all views/comments, all time)
This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.
EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.
// unnormalized - x is some page id
un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) +
0.7*log(comments(x)+10)/log(10+maxComments())
// the original formula (now in pseudo code)
The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.
we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)
To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x
(1.0-un(x)) * (un(0)-0.1)
n(x) = un(x) - ------------------------- when un(0) != 1.0
1.0-un(0)
= 0.1 otherwise.
Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles
with W1+W2=1
Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)
What you're looking for here is not an algorithm, but a formula.
Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.
Instead, let's try to walk through the problem together.
You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.
You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.
Now, the first thing we need to decide is the relative weight of Comments vs Views.
If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.
If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like
Priority = 5 * Comments + Views
Obviously, this can be generalized to
Priority = A * Comments + B * Views
Where A and B are relative weights.
But, sometimes we want our weights to be exponential instead of linear, like
Priority = Comment ^ A + Views ^ B
which will give a very different curve than the earlier formula.
Similarly,
Priority = Comment ^ A * Views ^ B
will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.
So, to summarize:
You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.
We can't do it for you, because we don't know how you want to value things.
I know it has been a while since this was asked, but I encountered a similar problem and had a different solution.
When you want to have a way to rank something, and there are multiple factors that you're using to perform that ranking, you're doing something called multi-criteria decision analysis. (MCDA). See: http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis
There are several ways to handle this. In your case, your criteria have different "units". One is in units of comments, the other is in units of views. Futhermore, you may want to give different weight to these criteria based on whatever business rules you come up with.
In that case, the best solution is something called a weighted product model. See: http://en.wikipedia.org/wiki/Weighted_product_model
The gist is that you take each of your criteria and turn it into a percentage (as was previously suggested), then you take that percentage and raise it to the power of X, where X is a number between 0 and 1. This number represents your weight. Your total weights should add up to one.
Lastly, you multiple each of the results together to come up with a rank. If the rank is greater than 1, than the numerator page has a higher rank than the denominator page.
Each page would be compared against every other page by doing something like:
p1C = page 1 comments
p1V = page 1 view
p2C = page 2 comments
p2V = page 2 views
wC = comment weight
wV = view weight
rank = (p1C/p2C)^(wC) * (p1V/p2V)^(wV)
The end result is a sorted list of pages according to their rank.
I've implemented this in C# by performing a sort on a collection of objects implementing IComparable.
What several posters have essentially advocated without conceptual clarification is that you use linear regression to determine a weighting function of webpage view and comment counts to establish priority.
This technique is pretty easy to implement for your problem, and the basic concept is described well in this Wikipedia article on linear regression models.
A quick summary of how to apply it to your problem is:
Determine the parameters of the line which best fits the view and comment count data for all your site's webpages, i.e., use linear regression.
Use the line parameters to derive your priority function for the view/count parameters.
Code examples for basic linear regression should not be hard to track down if you don't want to implement it from scratch from basic math formulas (use the web, Numerical Recipes, etc.). Also, any general math software package like Matlab, R, etc., comes with linear regression functions.
The most naive approach would be the following:
Let v[i] the views of page i, c[i] the number of comments for page i, then define the relative view weight for page i to be
r_v(i) = v[i]/(sum_j v[j])
where sum_j v[j] is the total of the v[.] over all pages. Similarly define the relative comment weight for page i to be
r_c(i) = c[i]/(sum_j c[j]).
Now you want some constant parameter p: 0 < p < 1 which indicates the importance of views over comments: p = 0 means only comments are significant, p = 1 means only views are significant, and p = 0.5 gives equal weight.
Then set the priority to be
p*r_v(i) + (1-p)*r_c(i)
This might be over-simplistic but its probably the best starting point.

Resources