Time related function for calculate rate like reddit - ranking

As we know that reddit has its ranking algorithm as well as stackoverflow.
I want to development a ranking algorithm for audit content dynamic ranking. Users can audit the contents by digg/bury. The ranking algorithm is for making the oldest/fewest actions content be auditted first.
Any ideas?

Here are the classical formulas: http://www.seomoz.org/blog/reddit-stumbleupon-delicious-and-hacker-news-algorithms-exposed
Formula:
(p - 1) / (t + 2)^1.5
Description:
Votes divided by age factor
p = votes (points) from users.
t = time since submission in hours.
p is subtracted by 1 to negate submitters vote.
age factor is (time since submission in hours plus two) to the power of 1.5.

Related

Equivalent of Bayesian average for unary rating system

I am really looking forward to implement bayesian average rating system for a site I'm developing. I have faced a problem though - all of the examples I can find on the net, are for multi-value rating systems, with the smallest being binary - likes / dislikes (Apply Bayesian average in a NON 5-star rating system).
I cannot seem to understand how I could apply binary bayesian to a unary rating system.
I have no dislikes, I have only likes.
Given the algorithm:
(n / (n + C)) * j + (C / (n + C)) * m
C is the average number of ratings an item receives
m is the average rating across all items
n is the number of ratings the current item
j is the average rating for the current item
I get stuck on m - the average rating accross all items. The average rating is 1 for everything.
How do I tweak this formula for unary rating system?
Maybe there are other, better suited equivalents of bayesian for such task?
Number of likes is a one-dimensional input, so it's hard to do anything interesting without another input. Two possibilities are how old the item is and how many users have viewed it.

Open-ended tournament pairing algorithm

I'm developing a tournament model for a virtual city commerce game (Urbien.com) and would love to get some algorithm suggestions. Here's the scenario and current "basic" implementation:
Scenario
Entries are paired up duel-style, like on the original Facemash or Pixoto.com.
The "player" is a judge, who gets a stream of dueling pairs and must choose a winner for each pair.
Tournaments never end, people can submit new entries at any time and winners of the day/week/month/millenium are chosen based on the data at that date.
Problems to be solved
Rating algorithm - how to rate tournament entries and how to adjust their ratings after each match?
Pairing algorithm - how to choose the next pair to feed the player?
Current solution
Rating algorithm - the Elo rating system currently used in chess and other tournaments.
Pairing algorithm - our current algorithm recognizes two imperatives:
Give more duels to entries that have had less duels so far
Match people with similar ratings with higher probability
Given:
N = total number of entries in the tournament
D = total number of duels played in the tournament so far by all players
Dx = how many duels player x has had so far
To choose players x and y to duel, we first choose player x with probability:
p(x) = (1 - (Dx / D)) / N
Then choose player y the following way:
Sort the players by rating
Let the probability of choosing player j at index jIdx in the sorted list be:
p(j) = ...
0, if (j == x)
n*r^abs(jIdx - xIdx) otherwise
where 0 < r < 1 is a coefficient to be chosen, and n is a normalization factor.
Basically the probabilities in either direction from x form a geometic series, normalized so they sum to 1.
Concerns
Maximize informational value of a duel - pairing the lowest rated entry against the highest rated entry is very unlikely to give you any useful information.
Speed - we don't want to do massive amounts of calculations just to choose one pair. One alternative is to use something like the Swiss pairing system and pair up all entries at once, instead of choosing new duels one at a time. This has the drawback (?) that all entries submitted in a given timeframe will experience roughly the same amount of duels, which may or may not be desirable.
Equilibrium - Pixoto's ImageDuel algorithm detects when entries are unlikely to further improve their rating and gives them less duels from then on. The benefits of such detection are debatable. On the one hand, you can save on computation if you "pause" half the entries. On the other hand, entries with established ratings may be the perfect matches for new entries, to establish the newbies' ratings.
Number of entries - if there are just a few entries, say 10, perhaps a simpler algorithm should be used.
Wins/Losses - how does the player's win/loss ratio affect the next pairing, if at all?
Storage - what to store about each entry and about the tournament itself? Currently stored:
Tournament Entry: # duels so far, # wins, # losses, rating
Tournament: # duels so far, # entries
instead of throwing in ELO and ad-hoc probability formulae, you could use a standard approach based on the maximum likelihood method.
The maximum likelihood method is a method for parameter estimation and it works like this (example). Every contestant (player) is assigned a parameter s[i] (1 <= i <= N where N is total number of contestants) that measures the strength or skill of that player. You pick a formula that maps the strengths of two players into a probability that the first player wins. For example,
P(i, j) = 1/(1 + exp(s[j] - s[i]))
which is the logistic curve (see http://en.wikipedia.org/wiki/Sigmoid_function). When you have then a table that shows the actual results between the users, you use global optimization (e.g. gradient descent) to find those strength parameters s[1] .. s[N] that maximize the probability of the actually observed match result. E.g. if you have three contestants and have observed two results:
Player 1 won over Player 2
Player 2 won over Player 3
then you find parameters s[1], s[2], s[3] that maximize the value of the product
P(1, 2) * P(2, 3)
Incidentally, it can be easier to maximize
log P(1, 2) + log P(2, 3)
Note that if you use something like the logistics curve, it is only the difference of the strength parameters that matters so you need to anchor the values somewhere, e.g. choose arbitrarily
s[1] = 0
In order to have more recent matches "weigh" more, you can adjust the importance of the match results based on their age. If t measures the time since a match took place (in some time units), you can maximize the value of the sum (using the example)
e^-t log P(1, 2) + e^-t' log P(2, 3)
where t and t' are the ages of the matches 1-2 and 2-3, so that those games that occurred more recently weigh more.
The interesting thing in this approach is that when the strength parameters have values, the P(...) formula can be used immediately to calculate the win/lose probability for any future match. To pair contestants, you can pair those where the P(...) value is close to 0.5, and then prefer those contestants whose time-adjusted number of matches (sum of e^-t1 + e^-t2 + ...) for match ages t1, t2, ... is low. The best thing would be to calculate the total impact of a win or loss between two players globally and then prefer those matches that have the largest expected impact on the ratings, but that could require lots of calculations.
You don't need to run the maximum likelihood estimation / global optimization algorithm all the time; you can run it e.g. once a day as a batch run and use the results for the next day for matching people together. The time-adjusted match masses can be updated real time anyway.
On algorithm side, you can sort the players after the maximum likelihood run base on their s parameter, so it's very easy to find equal-strength players quickly.

Shopping cart minimization algorithm

I have a list of products, which consists of list of shops, which sold it.
{
'Book A': [ShopA, ShopB, ShopC],
'Book B': [ShopC, ShopD],
'Movie C': [ShopA, ShopB, ShopD, ShopE],
...
}
(Price differs between the shops)
Each shop is also has a shipping cost. It's a "per-order" shipping cost, it doesn't matter how many items are in my cart. And it differs between the shops too.
Ex: if I buy "Book A" from ShopA, "Book B" from ShopC and "Movie C" from ShopA, the resulting price is: Book A price in ShopA + Book B price in ShopC + Movie C price in ShopA + ShopC shipping cost + ShopA shipping cost
If the shipping cost was zero or it was on per-item basis and constant, than I would just sort the offer lists by price+shipping field and fetch the first result from each set.
I need to buy all the items once and find the minimal price and the resulting set.
I'm not very good with optimization algorithms and dynamic programming so I need a solution or just a nod into the right direction.
This problem is NP Hard.
We will show a reduction from the Hitting Set problem.
Hitting Set problem: Given sets S1,S2,...,Sn and a number k: chose set S of size k, such that for every Si there is an element s in S such that s is in Si. [alternative definition: the intersection between each Si and S is not empty].
Reduction:
Given an instance of hitting set, in the form of (S1,...,Sn,k) create an instance of this problem:
All books cost nothing. In order to buy from each store you pay 1.
The book i is sold by each store denoted in Si, minimal price for this instance is k.
proof:
Hitting Set -> This problem: Assume there is a minimal hitting set in (S1,...,Sn) of size k. Let this hitting set be S. By buying from each store in S, we can buy all our books at cost k, since the books cost nothing [in our construction], and we bought all books, and we paid for the ordering from stores exactly k, thus the total price was k.
This problem -> Hitting set: Assume there is a pricing of k for the problem at the question. Then, from the building of the problem, and since the books cost nothing, we need to buy in k different stores to get all books. Let these stores be S. From the construction of the problem, S is a hitting set for (S1,...,Sn)
Q.E.D.
Conclusion:
Thus, this problem is "not easier then" Hitting Set Problem, and there is no known polynomial solution for this problem, so - your best shot, if you want optimal solution, is probably an exponential one, such as backtracking [Check all possibilities, and return the minimal solution].
With so little items I have a solution. It is dynamic.
We will process every shop iteratively. At every step we store the current best price with which we can cover all subsets of items. In the beginning all of them are infinity in price except for the empty subset which is 0 of price. Note that all subsets are 2^Num_products in count but in your case these are only about 1000.
Now how do we process the next to follow shop: Consider you cover every possible subset of the products with this shop (i mean subset that the shop can actually provide) and all the rest of the products being covered by shops you already observed, thus improving the minimal costs of covering every subset. This step takes 2^Num_products*2^Num_products=4^Num_products, still about a million which is bareable. You do this for every shop and at the end the answer is the cost of covering all the elements. The whole complexity of the proposed solution is 4^Num_products * num_shops which is about 50 million which is good to go.
Note that this is still exponential and this is not surprising. Thank you to amit for his incredible proof of NP hard.
EDIT Adding further explanation of the algorithm in pseudocode:
init:
cost[subset] = infi
cost[{}] = 0
for shop in shops
new_prices = costs.dup()
for set : subsets
for covered_set : all_subsets(set)
price = covered_set == {} ? 0 : delivery[shop]
remaining = set
for element : covered_set
if shop do not sale element
break for, choose next covered_set
price += el_price[element]
remaining.remove(element)
price += costs[remaining]
new_prices[set] = min(new_prices[set], price)
costs = new_prices
return costs[all]
Note that here I use sets as index - this is because I actually use the bitmask representation of the subsets e.g 1101 is a subset containing the 1st, 2nd and the forth element. Thus an iteration of all sets is for (int i = 0; i < (1 << n); i++).
There is also one more thing: if you want to cycle all the subsets of a subset S you can actually do it faster than iterating all the subsets of the initial set and checking whether the subset is subset of S. If S is also represented with bitmask bit_mask this for loop does the job: for(int i = bit_mask; i > 0; i = (i - 1) & bitmask). Using this approach you decrease the complexity of the algorithm to 3^Num_products * num_shops. However, this is a bit harder to understand and you will probably need to write by hand one example to make sure the loop I wrote actually cycles all the subsets of S. About the complexity - just trust me.
EDIT2 Edited the break condition. also let me elaborate on the set remaining and its calculation: as dmzkrsk pointed out the pseudocode mentions removal from the set, but you can actually just assign remaining = set ^ covered_set (again bit operation) in case of using bitmasks to represent the subsets.
I have dealt with this exact problem once. I didn't come up with any other solution than just testing every possible combination of shops but there is an easy way to filter out many of the shops in every product.
1. Calculate the lowest price (shipping cost included) of every product, let's call it best_price.
2. In every product, retain only the shops where price of the shop (without shipping cost) <= best_price (with shipping cost)
3. Test every possible combination of shops for the cheapest.
A good heuristic can be the ant colony optimization. I use it to solve the travel salesman problem. You can find a working example from google tsp solver. It's a javascript library that uses also a brute force and a dynamic programming solution. The AOC is used when you have more cities to compute then the current limit of 20 cities. I believe you can use the library to solve your problem and it just need a little rewrite. With 20 cities the program has to check 20! posibilites. In your case it's a bit lighter but maybe only a magnitude.

How to provide most relevant results with Multiple Factor Weighted Sorting

I need to provide a weighted sort on 2+ factors, ordered by "relevancy". However, the factors aren't completely isolated, in that I want one or more of the factors to affect the "urgency" (weight) of the others.
Example: contributed content (articles) can be up-/down-voted, and thus have a rating; they have a post date, and they're also tagged with categories. Users write the articles and can vote, and may or may not have some kind of ranking themselves (expert, etc). Probably similar to StackOverflow, right?
I want to provide each user with a list of articles grouped by tag but sorted by "relevancy", where relevancy is calculated based on the rating and age of the article, and possibly affected by the ranking of the author. I.E. a highly ranked article that was written several years ago may not necessarily be as relevant as a medium ranked article written yesterday. And maybe if an article was written by an expert it would be treated as more relevant than one written by "Joe Schmoe".
Another good example would be assigning hotels a "meta score" comprised of price, rating, and attractions.
My question is, what is the best algorithm for multiple factor sorting? This may be a duplicate of that question, but I'm interested in a generic algorithm for any number of factors (a more reasonable expectation is 2 - 4 factors), preferably a "fully-automatic" function that I don't have to tweak or require user input, and I can't parse linear algebra and eigenvector wackiness.
Possibilities I've found so far:
Note: S is the "sorting score"
"Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax). I think this is kinda how Lucene search works.
"Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 + 10 * F3 ....
Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).
Options #3 or #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but the problem is how do you do this for more than two factors?
I also came across the SQL implementation for a two-factor weighting algorithm, which is basically what I'll need to write eventually.
As mentioned in the comments, I would suggest what's called the 'compromise solution' to anyone with a similar problem who is more concerned with not having to set weights than with making one criterion more heavily weighted than the others.
Basically, you consider each of your criterion as a coordinate (after normalization, of course). Based on your judgement, you choose the absolute optimal point, e.g. in this case, the highest rank author, the newest article, etc. Once you choose the optimal solution, each other 'solution' is rated based on its distance from that optimal. A sample formula would be the inverse of the Euclidean distance for each article's score: S = 1/(sqrt((rank - rank_ideal)^2 + (age - age_ideal)^2 + ... + (xn - xn_ideal)^2)).
This treats all criteria as equal, so keep that in mind.
Consider chaining of the weights. E.g. you have 3 factors: X, Y and Z.
You can calculate ETVyz as W = (Z/Zmax * Y) + (1 - Z/Zmax) * Yavg for each record and then calculate ETVxw as S = (W/Wmax * X) + (1 - W/Wmax) * Xavg.
You can chain more factors similary.
The solution, pointed shortly by #gankoji is a simplification of the TOPSIS method.
In TOPSIS the compromise solution can be regarded as choosing the solution with the shortest Euclidean distance from the ideal solution and the farthest Euclidean distance from the negative ideal solution.
This class of problems falls under the term MCDM - Multiple Criteria Decision Making.
Python packages scikit-criteria and mcdm provide implementations of most popular methods. The package docs link to the respective algorithm papers.

How to balance number of ratings versus the ratings themselves?

For a school project, we'll have to implement a ranking system. However, we figured that a dumb rank average would suck: something that one user ranked 5 stars would have a better average that something 188 users ranked 4 stars, and that's just stupid.
So I'm wondering if any of you have an example algorithm of "smart" ranking. It only needs to take in account the rankings given and the number of rankings.
Thanks!
You can use a method inspired by Bayesian probability. The gist of the approach is to have an initial belief about the true rating of an item, and use users' ratings to update your belief.
This approach requires two parameters:
What do you think is the true "default" rating of an item, if you have no ratings at all for the item? Call this number R, the "initial belief".
How much weight do you give to the initial belief, compared to the user ratings? Call this W, where the initial belief is "worth" W user ratings of that value.
With the parameters R and W, computing the new rating is simple: assume you have W ratings of value R along with any user ratings, and compute the average. For example, if R = 2 and W = 3, we compute the final score for various scenarios below:
100 (user) ratings of 4: (3*2 + 100*4) / (3 + 100) = 3.94
3 ratings of 5 and 1 rating of 4: (3*2 + 3*5 + 1*4) / (3 + 3 + 1) = 3.57
10 ratings of 4: (3*2 + 10*4) / (3 + 10) = 3.54
1 rating of 5: (3*2 + 1*5) / (3 + 1) = 2.75
No user ratings: (3*2 + 0) / (3 + 0) = 2
1 rating of 1: (3*2 + 1*1) / (3 + 1) = 1.75
This computation takes into consideration the number of user ratings, and the values of those ratings. As a result, the final score roughly corresponds to how happy one can expect to be about a particular item, given the data.
Choosing R
When you choose R, think about what value you would be comfortable assuming for an item with no ratings. Is the typical no-rating item actually 2.4 out of 5, if you were to instantly have everyone rate it? If so, R = 2.4 would be a reasonable choice.
You should not use the minimum value on the rating scale for this parameter, since an item rated extremely poorly by users should end up "worse" than a default item with no ratings.
If you want to pick R using data rather than just intuition, you can use the following method:
Consider all items with at least some threshold of user ratings (so you can be confident that the average user rating is reasonably accurate).
For each item, assume its "true score" is the average user rating.
Choose R to be the median of those scores.
If you want to be slightly more optimistic or pessimistic about a no-rating item, you can choose R to be a different percentile of the scores, for instance the 60th percentile (optimistic) or 40th percentile (pessimistic).
Choosing W
The choice of W should depend on how many ratings a typical item has, and how consistent ratings are. W can be higher if items naturally obtain many ratings, and W should be higher if you have less confidence in user ratings (e.g., if you have high spammer activity). Note that W does not have to be an integer, and can be less than 1.
Choosing W is a more subjective matter than choosing R. However, here are some guidelines:
If a typical item obtains C ratings, then W should not exceed C, or else the final score will be more dependent on R than on the actual user ratings. Instead, W should be close to a fraction of C, perhaps between C/20 and C/5 (depending on how noisy or "spammy" ratings are).
If historical ratings are usually consistent (for an individual item), then W should be relatively small. On the other hand, if ratings for an item vary wildly, then W should be relatively large. You can think of this algorithm as "absorbing" W ratings that are abnormally high or low, turning those ratings into more moderate ones.
In the extreme, setting W = 0 is equivalent to using only the average of user ratings. Setting W = infinity is equivalent to proclaiming that every item has a true rating of R, regardless of the user ratings. Clearly, neither of these extremes are appropriate.
Setting W too large can have the effect of favoring an item with many moderately-high ratings over an item with slightly fewer exceptionally-high ratings.
I appreciated the top answer at the time of posting, so here it is codified as JavaScript:
const defaultR = 2;
const defaultW = 3; // should not exceed typicalNumberOfRatingsPerAnswers 0 is equivalent to using only average of ratings
function getSortAlgoValue(ratings) {
const allRatings = ratings.reduce((sum, r) => sum + r, 0);
return (defaultR * defaultW + allRatings) / (defaultW + ratings.length);
}
Only listed as a separate answer because the formatting of the code block as a reply wasn't very
Since you've stated that the machine would only be given the rankings and the number of rankings, I would argue that it may be negligent to attempt a calculated weighting method.
First, there are two many unknowns to confirm the proposition that in enough circumstances a larger quantity of ratings are a better indication of quality than a smaller number of ratings. One example is how long have rankings been given? Has there been equal collection duration (equal attention) given to different items ranked with this same method? Others are, which markets have had access to this item and, of course, who specifically ranked it?
Secondly, you've stated in a comment below the question that this is not for front-end use but rather "the ratings are generated by machines, for machines," as a response to my comment that "it's not necessarily only statistical. One person might consider 50 ratings enough, where that might not be enough for another. And some raters' profiles might look more reliable to one person than to another. When that's transparent, it lets the user make a more informed assessment."
Why would that be any different for machines? :)
In any case, if this is about machine-to-machine rankings, the question needs greater detail in order for us to understand how different machines might generate and use the rankings.
Can a ranking generated by a machine be flawed (so as to suggest that more rankings may somehow compensate for those "flawed" rankings? What does that even mean - is it a machine error? Or is it because the item has no use to this particular machine, for example? There are many issues here we might first want to unpack, including if we have access to how the machines are generating the ranking, on some level we may already know the meaning this item may have for this machine, making the aggregated ranking superfluous.
What you can find on different plattforms is the blanking of ratings without enough votings: "This item does not have enough votings"
The problem is you can't do it in an easy formula to calculate a ranking.
I would suggest a hiding of ranking with less than minimum votings but caclulate intern a moving average. I always prefer moving average against total average as it prefers votings from the last time against very old votings which might be given for totaly different circumstances.
Additionally you do not need to have too add a list of all votings. you just have the calculated average and the next voting just changes this value.
newAverage = weight * newVoting + (1-weight) * oldAverage
with a weight about 0.05 for a preference of the last 20 values. (just experiment with this weight)
Additionally I would start with these conditions:
no votings = medium range value (1-5 stars => start with 3 stars)
the average will not be shown if less than 10 votings were given.
A simple solution might be a weighted average:
sum(votes) / number_of_votes
That way, 3 people voting 1 star, and one person voting 5 would give a weighted average of (1+1+1+5)/4 = 2 stars.
Simple, effective, and probably sufficient for your purposes.

Resources