How to choose matchups in an ELO ratings system as matchups accumulate

How to choose matchups in an ELO ratings system as matchups accumulate - sorting

I'm working on a crowdsourced app that will pit about 64 fictional strongmen/strongwomen from different franchises against one another and try and determine who the strongest is. (Think "Batman vs. Spiderman" writ large). Users will choose the winner of any given matchup between two at a time.
After researching many sorting algorithms, I found this fantastic SO post outlining the ELO rating system, which seems absolutely perfect. I've read up on the system and understand both how to award/subtract points in a matchup and how to calculate the performance rating between any two characters based on past results.
What I can't seem to find is any efficient and sensible way to determine which two characters to pit against one another at a given time. Naturally it will start off randomly, but quickly points will accumulate or degrade. We can expect a lot of disagreement but also, if I design this correctly, a large amount of user participation.
So imagine you arrive at this feature after 50,000 votes have been cast. Given that we can expect all sorts of non-transitive results under the hood, and a fair amount of deviance from the performance ratings, is there a way to calculate which matchups I most need more data on? It doesn't seem as simple as choosing two adjacent characters in a sorted list with the closest scores, or just focusing at the top of the list.
With 64 entrants (and yes, I did consider and reject a bracket!), I'm not worried about recomputing the performance ratings after every matchup. I just don't know how to choose the next one, seeing as we'll be ignorant of each voter's biases and favorite characters.

The amazing variation that you experience with multiplayer games is that different people with different ratings "queue up" at different times.
By the ELO system, ideally all players should be matched up with an available player with the closest score to them. Since, if I understand correctly, the 64 "players" in your game are always available, this combination leads to lack of variety, as optimal match ups will always be, well, optimal.
To resolve this, I suggest implementing a priority queue, based on when your "players" feel like playing again. For example, if one wants to take a long break, they may receive a low priority and be placed towards the end of the queue, meaning it will be a while before you see them again. If one wants to take a short break, maybe after about 10 matches, you'll see them in a match again.
This "desire" can be done randomly, and you can assign different characteristics to each character to skew this behaviour, such as, "winning against a higher ELO player will make it more likely that this player will play again sooner". From a game design perspective, these personalities would make the characters seem more interesting to me, making me want to stick around.
So here you have an ordered list of players who want to play. I can think of three approaches you might take for the actual matchmaking:
Peek at the first 5 players in the queue and pick the best match up
Match the first player with their best match in the next 4 players in the queue (presumably waited the longest so should be queued immediately, regardless of the fairness of the match up)
A combination of both, where if the person at the head of the list doesn't get picked, they'll increase in "entropy", which affects the ELO calculation making them more likely to get matched up
Edit
On an implementation perspective, I'd recommend using a delta list instead of an actual priority queue since players should be "promoted" as they wait.

To avoid obvious winner vs looser situation you group the players in tiers.
Obviously, initially everybody will be in the same tier [0 - N1].
Then within the tier you make a rotational schedule so each two parties can "match" at least once.
However if you don't want to maintain schedule ...then always match with the party who participated in the least amount of "matches". If there are multiple of those make a random pick.
This way you ensure that everybody participates fairly the same amount of "matches".

Related

Prevent Genetic algorithm in zero-sum game from cooperating

I have a specific game, which is not literally zero-sum, because points are awarded by the game during a match, but close to it, in the sense that the number of total points have a clear upper limit, so the more points you score, the less points are available for your opponents.
The game is played by 5 players, with no teams whatsoever.
I'm making a genetic algorithm play rounds against itself with pseudo-random "mutations" between generations.
But after a couple hundred generations, a pattern always emerges. The algorithm ends up strongly favoring a specific player (example: the player who plays first). Since the mutations giving the "best results" serve as a base for the next generation, this seems to move towards some version of "If you are the first player, play this way (the way being a very specific yet pretty random technique that gives bad, or at best average, results), and if not, then play in this specific way that indirectly but strongly favors the first player".
Then, for the next generations, the player whose turn is strongly favored starts mutating totally randomly because it wins every round no matter what it does, as long as the part of the algorithm that favors that player is still intact.
I'm looking for a way to prevent this specific evolution route, but I can't figure out how to possibly "reward" victory by your own strategy more than victory because you were helped a lot.

I think this happens because only the winner of the round robbin tournament gets promoted and mutated on each generation. At first players more or less win randomly, but then a strategy comes up that favors a position. Now I guess that slightly diverting from that strategy (pseudo-random mutations) makes you only lose the games where you are in the favoured position but not win any of the others, so you will never divert from that strategy, something like a local Nash equilibrium.
You could try to keep more than one individual per generation and generate mutations from them. But I doubt this will help and at best delay the effect. Because soon the code of the best individual will spread to all. This seems to be the root cause of the problem.
Therefore my suggestion would be to have t tribes where each tribe has x/t individuals. Now instead of playing a round robbin tournament each individual plays only against the individuals of other tribes. Then you keep the best individual per tribe, mutate and proceed with the next generation. So that the tribes never mix genes.

To me, it seems like there is an easy fix: play multiple games each evaluation.
Instead of each generation only testing one game, strongly favouring the starting player, play 5 games and distribute who starts first equally ( so every player starts first at least once ).
I suppose your population is larger than 5, right? So how are you testing the genomes against each other? You should definitely not have let them play only one game, because maybe you have paired up a medium player against 4 easy players, making it seem like the medium player is better.

Assign people to monthly groups without repeating

I have a problem for which I'm not sure there is a simple answer...
I have some number of captains (right now it's 11, but could change) that organize monthly meetings of members of an organization. The members of the organization are categorized into 1 of ~14 networks.
I'm tasked with writing a scheduler of sorts that fits members into groups with 1 captain such that:
No 2 members of the same network is ever in a group
No 2 members participate in the same meeting for 3 months.
I need to minimize the number of unassigned members, as they will be assigned manually as the admin sees fit.
Right now I'm trying a randomized brute force method that essentially generates the groups randomly, checks for repeats over last month and the month prior, and kicks out the bad matrix if it finds one and regenerates. I made it iterate over several thousand times and it's pretty much impossible for the randomized brute force method to work in any sort of reasonable amount of time.
I'd like some sort of algorithm that could fail gracefully in that if it's mathematically impossible for the rules to be followed 100%, it still generates a matrix with as few "errors" as possible.
I've look at the Round-Robin Algorithm, but it appears that's only good for groups of 2.
I think the thing that makes this more difficult is there's a small number of participants (67) compared to the number of captains and networks.
I'm not necessarily asking for code to do this, but more the types of algorithms that could handle something like this.

A way to do ranked match making in an ELO system

I'm trying to create a ranked match making algorithm for my iOS game, and I'm not sure how to start.
The game is very chess-like, thus the ranking system is done with ELO. I am able to receive a (potentially large) list of all queued users in string form, but that is all that I have. I've been concatenating the elo+timestamp of the user as he/she joins the queue (this creates like 5-6 extra steps of parsing to sort the array + objects). Another option is to store custom user data which I thought was pointless since I cannot know the elo of the player until I request the pull on that piece of information on that user. Since sorting the entire list is probably out of the option based on the length of this list, I've thinking about this method:
In an elo system, if I just get a random (100) amount of users from the list... I assume that the range of elo will be a portion around the median of the ELO distribution. This may be fine for players with ELO near the median, but for higher or lower ELO players this sub array would have to be bigger and thus bringing me back to my original problem.
My question: Is there any documented methods of random match making? I'm really just asking for some well known ways to do this since I cannot come up with a method that sounds feasible.
This seems like an issue that involves a lot of graph theory and research etc so it's probably above me to figure out from scratch. I tried to do very thorough google searches but it seems everyone just wants to rage at LOL's matchmaking :(

Player rating for game with random teams

I am working on an algorithm to score individual players in a team-based game. The problem is that no fixed teams exist - every time 10 players want to play, they are divided into two (somewhat) even teams and play each other. For this reason, it makes no sense to score the teams, and instead we need to rely on individual player ratings.
There are a number of problems that I wish to take into account:
New players need some sort of provisional ranking to reach their "real" rating, before their rating counts the same as seasoned players.
The system needs to take into account that a team may consist of a mix of player skill levels - eg. one really good, one good, two mediocre, and one really poor. Therefore a simple "average" of player ratings probably won't suffice and it probably needs to be weighted in some way.
Ratings are adjusted after every game and as such the algorithm needs to be based on a per-game basis, not per "rating period". This might change if a good solution comes up (I am aware that Glicko uses a rating period).
Note that cheating is not an issue for this algorithm, since we have other measures of validating players.
I have looked at TrueSkill, Glicko and ELO (which is what we're currently using). I like the idea of TrueSkill/Glicko where you have a deviation that is used to determine how precise a rating is, but none of the algorithms take the random teams perspective into account and seem to be mostly based on 1v1 or FFA games.
It was suggested somewhere that you rate players as if each player from the winning team had beaten all the players on the losing team (25 "duels"), but I am unsure if that is the right approach, since it might wildly inflate the rating when a really poor player is on the winning team and gets a win vs. a very good player on the losing team.
Any and all suggestions are welcome!
EDIT: I am looking for an algorithm for established players + some way to rank newbies, not the two combined. Sorry for the confusion.
There is no AI and players only play each other. Games are determined by win/loss (there is no draw).

Provisional ranking systems are always imperfect, but the better ones (such as Elo) are designed to adjust provisional ratings more quickly than for ratings of established players. This acknowledges that trying to establish an ability rating off of just a few games with other players will inherently be error-prone.
I think you should use the average rating of all players on the opposing team as the input for establishing the provisional rating of the novice player, but handle it as just one game, not as N games vs. N players. Each game is really just one data sample, and the Elo system handles accumulation of these games to improve the ranking estimate for an individual player over time before switching over to the normal ranking system.
For simplicity, I would also not distinguish between established and provisional ratings for members of the opposing team when calculating a new provision rating for some member of the other team (unless Elo requires this). All of these ratings have implied error, so there is no point in adding unnecessary complications of probably little value in improving ranking estimates.

First off: It is very very unlikely that you will find a perfect system. Every system will have a flaw somewhere.
And to answer your question: Perhaps the ideas here will help: Lehman Rating on OkBridge.
This rating system is in use (since 1993!) on the internet bridge site called OKBridge. Bridge is a partnership game and is usually played with a team of 2 opposing another team of 2. The rating system was devised to rate the individual players and caters to the fact that many people play with different partners.

Without any background in this area, it seems to me a ranking systems is basically a statistical model. A good model will converge to a consistent ranking over time, and the goal would be to converge as quickly as possible. Several thoughts occur to me, several of which have been touched upon in other postings:
Clearly, established players have a track record and new players don't. So the uncertainty is probably greater for new players, although for inconsistent players it could be very high. Also, this probably depends on whether the game primarily uses innate skills or acquired skills. I would think that you would want a "variance" parameter for each player. The variance could be made up of two parts: a true variance and a "temperature". The temperature is like in simulated annealing, where you have a temperature that cools over time. Presumably, the temperature would cool to zero after enough games have been played.
Are there multiple aspects that come in to play? Like in soccer, you may have good shooters, good passers, guys who have good ball control, etc. Basically, these would be the degrees of freedom in you system (in my soccer analogy, they may or may not be truly independent). It seems like an accurate model would take these into account, of course you could have a black box model that implicitly handles these. However, I would expect understanding the number of degrees of freedom in you system would be helpful in choosing the black box.
How do you divide teams? Your teaming algorithm implies a model of what makes equal teams. Maybe you could use this model to create a weighting for each player and/or an expected performance level. If there are different aspects of player skills, maybe you could give extra points for players whose performance in one aspect is significantly better than expected.
Is the game truly win or lose, or could the score differential come in to play? Since you said no ties this probably doesn't apply, but at the very least a close score may imply a higher uncertainty in the outcome.
If you're creating a model from scratch, I would design with the intent to change. At a minimum, I would expect there may be a number of parameters that would be tunable, and might even be auto tuning. For example, as you have more players and more games, the initial temperature and initial ratings values will be better known (assuming you are tracking the statistics). But I would certainly anticipate that the more games have been played the better the model you could build.
Just a bunch of random thoughts, but it sounds like a fun problem.

There was an article in Game Developer Magazine a few years back by some guys from the TrueSkill team at Microsoft, explaining some of their reasoning behind the decisions there. It definitely mentioned teams games for Xbox Live, so it should be at least somewhat relevant. I don't have a direct link to the article, but you can order the back issue here: http://www.gdmag.com/archive/oct06.htm
One specific point that I remember from the article was scoring the team as a whole, instead of e.g. giving more points to the player that got the most kills. That was to encourage people to help the team win instead of just trying to maximize their own score.
I believe there was also some discussion on tweaking the parameters to try to accelerate convergence to an accurate evaluation of the player skill, which sounds like what you're interested in.
Hope that helps...

how is the 'scoring' settled?,
if a team would score 25 points in total (scores of all players in the team) you could divide the players score by the total team score * 100 to get the percentage of how much that player did for the team (or all points with both teams).
You could calculate a score with this data,
and if the percentage is lower than i.e 90% of the team members (or members of both teams):
treat the player as a novice and calculate the score with a different weighing factor.
sometimes an easier concept works out better.

The first question has a very 'gamey' solution. you can either create a newbie lobby for the first couple of games where the players can't see their score yet until they finish a certain amount of games that give you enough data for accurate rating.
Another option is a variation on the first but simpler-give them a single match vs AI that will be used to determine beginning score (look at quake live for an example).

For anyone who stumbles in here years after it was posted: TrueSkill now supports teams made up of multiple players and changing configurations.

Every time 10 players want to play,
they are divided into two (somewhat)
even teams and play each other.
This is interesting, as it implies both that the average skill level on each team is equal (and thus unimportant) and that each team has an equal chance of winning. If you assume this constraint to hold true, a simple count of wins vs losses for each individual player should be as good a measure as any.

How to rank a million images with a crowdsourced sort

I'd like to rank a collection of landscape images by making a game whereby site visitors can rate them, in order to find out which images people find the most appealing.
What would be a good method of doing that?
Hot-or-Not style? I.e. show a single image, ask the user to rank it from 1-10. As I see it, this allows me to average the scores, and I would just need to ensure that I get an even distribution of votes across all the images. Fairly simple to implement.
Pick A-or-B? I.e. show two images, ask user to pick the better one. This is appealing as there is no numerical ranking, it's just a comparison. But how would I implement it? My first thought was to do it as a quicksort, with the comparison operations being provided by humans, and once completed, simply repeat the sort ad-infinitum.
How would you do it?
If you need numbers, I'm talking about one million images, on a site with 20,000 daily visits. I'd imagine a small proportion might play the game, for the sake of argument, lets say I can generate 2,000 human sort operations a day! It's a non-profit website, and the terminally curious will find it through my profile :)

As others have said, ranking 1-10 does not work that well because people have different levels.
The problem with the Pick A-or-B method is that its not guaranteed for the system to be transitive (A can beat B, but B beats C, and C beats A). Having nontransitive comparison operators breaks sorting algorithms. With quicksort, against this example, the letters not chosen as the pivot will be incorrectly ranked against each other.
At any given time, you want an absolute ranking of all the pictures (even if some/all of them are tied). You also want your ranking not to change unless someone votes.
I would use the Pick A-or-B (or tie) method, but determine ranking similar to the Elo ratings system which is used for rankings in 2 player games (originally chess):
The Elo player-rating
system compares players’ match records
against their opponents’ match records
and determines the probability of the
player winning the matchup. This
probability factor determines how many
points a players’ rating goes up or
down based on the results of each
match. When a player defeats an
opponent with a higher rating, the
player’s rating goes up more than if
he or she defeated a player with a
lower rating (since players should
defeat opponents who have lower
ratings).
The Elo System:
All new players start out with a base rating of 1600
WinProbability = 1/(10^(( Opponent’s Current Rating–Player’s Current Rating)/400) + 1)
ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a draw.
Player’s New Rating = Player’s Old Rating + (K-Value * (ScoringPt–Player’s Win Probability))
Replace "players" with pictures and you have a simple way of adjusting both pictures' rating based on a formula. You can then perform a ranking using those numeric scores. (K-Value here is the "Level" of the tournament. It's 8-16 for small local tournaments and 24-32 for larger invitationals/regionals. You can just use a constant like 20).
With this method, you only need to keep one number for each picture which is a lot less memory intensive than keeping the individual ranks of each picture to each other picture.
EDIT: Added a little more meat based on comments.

Most naive approaches to the problem have some serious issues. The worst is how bash.org and qdb.us displays quotes - users can vote a quote up (+1) or down (-1), and the list of best quotes is sorted by the total net score. This suffers from a horrible time bias - older quotes have accumulated huge numbers of positive votes via simple longevity even if they're only marginally humorous. This algorithm might make sense if jokes got funnier as they got older but - trust me - they don't.
There are various attempts to fix this - looking at the number of positive votes per time period, weighting more recent votes, implementing a decay system for older votes, calculating the ratio of positive to negative votes, etc. Most suffer from other flaws.
The best solution - I think - is the one that the websites The Funniest The Cutest, The Fairest, and Best Thing use - a modified Condorcet voting system:
The system gives each one a number based on, out of the things that it has faced, what percentage of them it usually beats. So each one gets the percentage score NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). Also, things are barred from the top list until they've been compared to a reasonable percentage of the set.
If there's a Condorcet winner in the set, this method will find it. Since that's unlikely, given the statistical nature, it finds the one that's the "closest" to being a Condorcet winner.
For more information on implementing such systems the Wikipedia page on Ranked Pairs should be helpful.
The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's a good thing. I believe it's very well accepted in decision theory that humans are vastly better at comparing two objects than they are at abstract ranking. Millions of years of evolution make us good at picking the best apple off the tree, but terrible at deciding how closely the apple we picked hews to the true Platonic Form of appleness. (This is, by the way, why the Analytic Hierarchy Process is so nifty...but that's getting a bit off topic.)
One final point to make is that SO uses an algorithm to find the best answers which is very similar to bash.org's algorithm to find the best quote. It works well here, but fails terribly there - in large part because an old, highly rated, but now outdated answer here is likely to be edited. bash.org doesn't allow editing, and it's not clear how you'd even go about editing decade-old jokes about now-dated internet memes even if you could... In any case, my point is that the right algorithm usually depends on the details of your problem. :-)

I know this question is quite old but I thought I'd contribute
I'd look at the TrueSkill system developed at Microsoft Research. It's like ELO but has a much faster convergence time (looks exponential compared to linear), so you get more out of each vote. It is, however, more complex mathematically.
http://en.wikipedia.org/wiki/TrueSkill

I don't like the Hot-or-Not style. Different people would pick different numbers even if they all liked the image exactly the same. Also I hate rating things out of 10, I never know which number to choose.
Pick A-or-B is much simpler and funner. You get to see two images, and comparisons are made between the images on the site.

These equations from Wikipedia makes it simpler/more effective to calculate Elo ratings, the algorithm for images A and B would be simple:
Get Ne, mA, mB and ratings RA,RB from your database.
Calculate KA ,KB, QA, QB by using the number of comparisons performed (Ne) and the number of times that image was compared (m) and current ratings :
Calculate EA and EB.
Score the winner's S : the winner as 1, loser as 0, and if you have a draw as 0.5,
Calculate the new ratings for both using:
Update the new ratings RA,RB and counts mA,mB in the database.

You may want to go with a combination.
First phase:
Hot-or-not style (although I would go with a 3 option vote: Sucks, Meh/OK. Cool!)
Once you've sorted the set into the 3 buckets, then I would select two images from the same bucket and go with the "Which is nicer"
You could then use an English Soccer system of promotion and demotion to move the top few "Sucks" into the Meh/OK region, in order to refine the edge cases.

Ranking 1-10 won't work, everyone has different levels. Someone who always gives 3-7 ratings would have his rankings eclipsed by people who always give 1 or 10.
a-or-b is more workable.

Wow, I'm late in the game.
I like the ELO system very much so, but like Owen says it seems to me that you'd be slow building up any significant results.
I believe humans have much greater capacity than just comparing two images, but you want to keep interactions to the bare minimum.
So how about you show n images (n being any number you can visibly display on a screen, this may be 10, 20, 30 depending on user's preference maybe) and get them to pick which they think is best in that lot. Now back to ELO. You need to modify you ratings system, but keep the same spirit. You have in fact compared one image to n-1 others. So you do your ELO rating n-1 times, but you should divide the change of rating by n-1 to match (so that results with different values of n are coherent with one another).
You're done. You've now got the best of all worlds. A simple rating system working with many images in one click.

If you prefer using the Pick A or B strategy I would recommend this paper: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf
Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013,
February). Pairwise ranking aggregation in a crowdsourced setting. In
Proceedings of the sixth ACM international conference on Web search
and data mining (pp. 193-202). ACM.
The paper tells about the Crowd-BT model which extends the famous Bradley-Terry pairwise comparison model into crowdsource setting. It also gives an adaptive learning algorithm to enhance the time and space efficiency of the model. You can find a Matlab implementation of the algorithm on Github (but I'm not sure if it works).

The defunct web site whatsbetter.com used an Elo style method. You can read about the method in their FAQ on the Internet Archive.

Pick A-or-B its the simplest and less prone to bias, however at each human interaction it gives you substantially less information. I think because of the bias reduction, Pick is superior and in the limit it provides you with the same information.
A very simple scoring scheme is to have a count for each picture. When someone gives a positive comparison increment the count, when someone gives a negative comparison, decrement the count.
Sorting a 1-million integer list is very quick and will take less than a second on a modern computer.
That said, the problem is rather ill-posed - It will take you 50 days to show each image only once.
I bet though you are more interested in the most highly ranked images? So, you probably want to bias your image retrieval by predicted rank - so you are more likely to show images that have already achieved a few positive comparisons. This way you will more quickly just start showing 'interesting' images.

I like the quick-sort option but I'd make a few tweeks:
Keep the "comparison" results in a DB and then average them.
Get more than one comparison per view by giving the user 4-6 images and having them sort them.
Select what images to display by running qsort and recording and trimming anything that you don't have enough data on. Then when you have enough items recorded, spit out a page.
The other fun option would be to use the crowd to teach a neural-net.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio