Voting algorithm: how to calculate rank? - ranking

I am trying to figure our a way to calculate rank. Right now it simply takes ratio of wins / losses of each individual entry, so e.g. one won 99 times out of a 100, it has 99% winning rank. BUT if an entry won 1 out of total 1 votes, it will have a 100% winning rank, but definitely it can't be higher that of the one that won 99 times. What would be a better way to do this?

Try something like this:
votes = wins + losses
score = votes * ( wins / votes )
That way, something with 50% wins, but a million votes would still be ahead of something with 100% wins but only one vote.
You can add in an extra weight based on age (in days in this example), too, something like
if age < 5:
score = score + ((highest real score on site) * ((5 - age) / 5)
This will put brand new entries right at the top of the first page, and then they will move slowly down the list over the course of the next 5 days (I'm assuming age is a fractional number, not just an integer). After the 5 days are up, they will be put in the list based solely on the score from the previous bit of pseudo-code.

Depending on how complicated you want to make it, the Elo system chess uses (or something similar) may be what you want: http://en.wikipedia.org/wiki/Elo_rating_system
Even if a person has won 1/1 matches, his rating would be far below someone who has won/lost hundreds of matches against tough opponents, for instance.

You could always use a point system rather than win/loss ratio. Winning would always give points and then you could play around with either removing points for losing, not awarding points at all for losing, or awarding less points for losing. It all depends on exactly how you want people to be ranked. For example you may want to give 2 points for winning and 1 point for losing if you want to favor people who participate over those who do not (which sounds kind of like what you were talking about in your example of the person playing 100 games vs 1 game). The NHL uses a similar technique for rankings (2 points for a win, 1 point for an overtime loss, 0 points for a regular loss). That might give you some more flexibility.

if i understand the question correctly, then whoever gets more votes has the higher rank.

Would it make sense to add more rank to winning entry if losing entry originally had a much higher rank, e.g. much stronger competitor?

Related

Algorithm to Make a Set of Random Outcomes Approach a Specific Percentage

Currently, I have a pool of basketball players where I have a projected total of points for each player. Additionally, I have a normal distribution function that gives me a random drawing from a normal distribution for each player. Currently, I have an algorithm that calculates n unique random lineups of 8 players based on some constraints. Between each lineup, the normal distribution function runs again to produce new predictions for each player. Then the best lineup is produced for that specific set of predictions.
I would like to tweak this algorithm in the following way. I would like to have 4 tiers of maximum and minimum percentages where each player is assigned a tier. Within the number of lineups generated, I would like each specific player to occur with that frequency. So for example if I wanted to generate 10 lineups and player 1 is in tier 1 which requires the player to be between 50-60%, then the player would occur in 5-6 lineups ideally.
I'm struggling with how to modify my current algorithm to include this stipulation. Any thoughts would be greatly appreciated! I just don't know how to force each player within a specific range of percentages.
There are a lot of ways to do it.
Here is an easy approach. Keep a current relative odds of being picked for each player. The actual probability is the relative odds divided by the sum of the odds. Each person starts with the expected number of times be selected. Whenever someone is selected, their relative odds is reduced by 1. If it goes below 0, that person is out of the pool.
This approach guarantees that each player will not be in more than a maximum number of teams. It makes it unlikely, but not impossible, that any given player will be in fewer teams than you want.
An easy way to solve that is to randomly round people's desired frequencies up and down to get the right integer count. And now everything has to come even.
There is yet another problem, though. Which is that it is possible that you'll not succeed in assignment to fill all the teams. But if you go from the most popular player to the least, the odds of such mistakes should be acceptably low. Doubly so if you widen the ranges slightly by populating a few extra teams, then throwing away ones that didn't work out.
First draft
So if I understand correctly, you have N players that might appear in the first
position of the string. But you want them to be selected not at random, but according
to some percentage.
Now the first step is to normalize those percentages:
Alice 20%
Bob 40%
Charlie 10%
Doug 60%
Eric 30%
The sum is 160%, so you generate a random number from 1 to 160; say it's 97.
97 is more than 20, so subtract 20 and ignore Alice.
77 is more than 40, so subtract 40 and ignore Bob.
37 is more than 10, so subtract 10 and ignore Charlie.
27 is less than 60: Doug it is.
You can also pre-populate a 160-element array with 20 "Alice" indexes, 60 "Doug" indexes etc., and your player is players[array[random(160)]].

How to give players a score on a ranking/prediction task?

I have a website built with php/mysql, and I am looking for help in communicating to a Programmer what I want him to do with a Poll/Prediction game that I am trying to create.
For purposes of discussion, assume a game where perhaps 100 players try to predict the top 5 finishers in a Golf Tournament of perhaps 9 Golfers.
I am looking for help in how to create and assign a score based upon the accuracy of prediction.
The players provide a rank ordering using a drag and drop function to order the players from 1 through 5. This ordering has already been coded, and the ranks are stored somehow in the DB (I do not know how).
My initial thinking is to ask the coder to create a script which will assign a score from 1 to 5 for each Golfer that the player nominated to be in the Top 5.
So, a player who predicted perfectly would be awarded a perfect score of 12345.
His first golfer received a 1 for finishing first, second a 2 for finishing second, third golfer receives a 3 for finishing third, and so on.
Anybody less than perfect would have a score higher than 12345.
Players who got the first four positions correct would have to be differentiated on the basis of the finish of their fifth Golfer.
So, one might score 12347 and the other 12348 and the player with the highest score (12348) would be the loser in a matchup of the two players.
A player who did poorly, might have a score of 53419.
Question:
Is this a viable way of creating a score which the players of my game can be ranked upon?
Is it possible to instead simply have something like a Spearman Rank-Order Correlation calculated comparing the Actual Finish Positions with the Predicted Finish Positions for each player,
and then rank players on the basis of the correlation coefficients for their rankings?
Thanks for any help in clarifying how to conceptualize this before approaching a programmer who gets annoyed when I don't really know what I want him to do ahead of time.
It's a quite interesting problem.
It seems that there are three components that need to be considered in the scoring: the number of correct predictions, the order of correct predictions, and the weight of correct predictions.
For example, assume the truth is:
1,5,10,15,20
Here are some predictions:
1,6,7,8,9 : only predicted first one
2,1,10,21,30 : 1 and 10, but the order of 1 is incorrect
20,15,1,5,30 : hit four in the top 5, but the orders are incorrect
It depends on what you value most. You may first check how many in the top 5 the user has predicted, add a value, and then penalize wrong orders. The weight for each position should also be different, this way
1,5,10,15,20 will rank higher than 1,5,10,20,15 and higher than 1,10,5,20,15
Spearman may be working, but I feel it could be too coarse for your purpose.
This is actually a very similar problem that search engines have. EG, in search engine evaluation, the actual outcomes are preferred results provided by humans, and the predicted outcomes are the results delivered by the search engine. In both your task and for search engines, I'd guess you care a lot more about the accuracy of the winner than the accuracy of the 5th place finisher. If that is the case, then the mean average precision is probably a good measure.

Feedback on ranking algorithm options for my website

I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)

What is the correct way to write a sweepstakes algorithm?

For example, if I wanted to ensure that I had one winner every four hours, and I expected to have 125 plays per hour, what is the best way to provide for the highest chance of having a winner and the lowest chance of having no winners at the end of the four hour period?
The gameplay is like a slot-machine, not a daily number. i.e. the entrant enters the game and gets notified right away if they have won or lost.
Sounds like a homework problem, I know, but it's not :)
Thanks.
There's really only so much you can do to keep things fair (i.e. someone who enters at the beginning of a four hour period has the same odds of winning as someone who enters at the end) if you want to enforce this constraint. In general, the best you can do while remaining legal is to take a guess at how many entrants you're going to have and set the probability accordingly (and if there's no winner at the end of a given period, give it to a random entrant from that period).
Here's what I'd do to adjust your sweepstakes probability as you go (setting aside the legal ramifications of doing so):
For each period, start the probability at 1 / (number of expected entries * 2)
At any time, if you get a winner, the probability goes to 0 for the rest of that period.
Every thirty minutes, if you're still without a winner, set the probability at 1 / ((number of expected entries * (1 - percentage of period complete)) * 2). So here, the percentage of period complete is the number of hours elapsed in that current period / number of total hours in the period (4). Basically as you go, the probability will scale upwards.
Practical example: expected entries is 200.
Starting probability = 1 / 400 = 0.0025.
After first half hour, we don't have a winner, so we reevaluate probability:
probability = 1 / ((200 * (1 - 0.125) * 2) = 1 / (200 * 2 * 0.875) = 1/350
This goes down all the way until the probability is a maximum of 1/50, assuming no winner occurs before then.
You can adjust these parameters if you want to maximize the acceleration or whatever. But I'd be remiss if I didn't emphasize that I don't believe running a sweepstakes like this is legal. I've done a few sweepstakes for my company and am somewhat familiar with the various laws and regulations, and the general rule of thumb, as I understand it, is that no one entrant should have an advantage over any other entrant that the other entrant doesn't know about. I'm no expert, but consult with your lawyer before running a sweepstakes like this. That said, the solution above will help you maximize your odds of giving away a prize.
If you're wanting a winner for every drawing, you'd simply pick a random winner from your entrants.
If you're doing it like a lottery, where you don't have to have a winner for every drawing, the odds are as high or low as you care to make them based on your selection scheme. For instance, if you have 125 entries per hour, and you're picking every four hours, that's 500 entries per contest. If you generate a random number between 1 and 1000, there's a 50% chance that someone will win, 1 and 750 is a 75% chance that someone will win, and so forth. Then you just pick the entry that corresponds to the random number generated.
There's a million different ways to implement selecting a winner, in the end you just need to pick one and use it consistently.

How to rank stories based on "controversy"?

I'd like to rank my stories based on "controversy" quotient. For example, reddit.com currently has "controversial" section: http://www.reddit.com/controversial/
When a story has a lot of up and a lot of down votes, it's controversial even though the total score is 0 (for example). How should I calculate this quotient score so that when there's a lot of people voting up and down, I can capture this somehow.
Thanks!!!
Nick
I would recommend using the standard deviation of the votes.
A controversial vote that's 100% polarised would have equal numbers of -1 and +1 votes, so the mean would be 0 and the stddev would be around 1.0
Conversely a completely consistent set of votes (with no votes in the opposite direction) would have a mean of 1 or -1 and a stddev of 0.0.
Votes that aren't either completely consistent or completely polarised will produce a standard deviation figure between 0 and ~1.0 where that value will indicate the degree of controversy in the vote.
The easiest method is to count the number of upvote/downvote pairings for a given comment within the timeframe (e.g. 1 week, 48 hours etc), and have comments with the most parings appear first. Anything more complex requires trial-and-error or experimentation on the best algorithm - as always, it varies on the content of the site and how you want it weighted.
Overall, it's not much different than a hotness algorithm, which works by detecting the most upvotes or views within a timeframe.
What about simply getting the smaller of the two values (up or down) of a point in time? If it goes up a lot and goes down a little, or the other way around it, is not controversial.
If for example the items has 10 ups and 5 downs, the "controversiality level" is 5, since there is 5 people disagreeing about liking it or not. On the other hand if it has either 10 ups or 10 downs, the "controversiality level" is 0, since no one is disagreeing.
So in the end the smaller of both items in this case defines the "hotness" or the "controversiality". Does this make sense?
// figure out if up or down is winning - doesn't matter which
if (up_votes > down_votes)
{
win_votes = up_votes;
lose_votes = down_votes;
}
else
{
win_votes = down_votes;
lose_votes = up_votes;
}
// losewin_ratio is always <= 1, near 0 if win_votes >> lose_votes
losewin_ratio = lose_votes / win_votes;
total_votes = up_votes + down_votes;
controversy_score = total_votes * losewin_ratio; // large means controversial
This formula will produce high scores for stories that have a lot of votes and a near 50/50 voting split, and low scores for stories that have either few votes or many votes for one choice.

Resources