Player rating for game with random teams

Player rating for game with random teams - algorithm

I am working on an algorithm to score individual players in a team-based game. The problem is that no fixed teams exist - every time 10 players want to play, they are divided into two (somewhat) even teams and play each other. For this reason, it makes no sense to score the teams, and instead we need to rely on individual player ratings.
There are a number of problems that I wish to take into account:
New players need some sort of provisional ranking to reach their "real" rating, before their rating counts the same as seasoned players.
The system needs to take into account that a team may consist of a mix of player skill levels - eg. one really good, one good, two mediocre, and one really poor. Therefore a simple "average" of player ratings probably won't suffice and it probably needs to be weighted in some way.
Ratings are adjusted after every game and as such the algorithm needs to be based on a per-game basis, not per "rating period". This might change if a good solution comes up (I am aware that Glicko uses a rating period).
Note that cheating is not an issue for this algorithm, since we have other measures of validating players.
I have looked at TrueSkill, Glicko and ELO (which is what we're currently using). I like the idea of TrueSkill/Glicko where you have a deviation that is used to determine how precise a rating is, but none of the algorithms take the random teams perspective into account and seem to be mostly based on 1v1 or FFA games.
It was suggested somewhere that you rate players as if each player from the winning team had beaten all the players on the losing team (25 "duels"), but I am unsure if that is the right approach, since it might wildly inflate the rating when a really poor player is on the winning team and gets a win vs. a very good player on the losing team.
Any and all suggestions are welcome!
EDIT: I am looking for an algorithm for established players + some way to rank newbies, not the two combined. Sorry for the confusion.
There is no AI and players only play each other. Games are determined by win/loss (there is no draw).

Provisional ranking systems are always imperfect, but the better ones (such as Elo) are designed to adjust provisional ratings more quickly than for ratings of established players. This acknowledges that trying to establish an ability rating off of just a few games with other players will inherently be error-prone.
I think you should use the average rating of all players on the opposing team as the input for establishing the provisional rating of the novice player, but handle it as just one game, not as N games vs. N players. Each game is really just one data sample, and the Elo system handles accumulation of these games to improve the ranking estimate for an individual player over time before switching over to the normal ranking system.
For simplicity, I would also not distinguish between established and provisional ratings for members of the opposing team when calculating a new provision rating for some member of the other team (unless Elo requires this). All of these ratings have implied error, so there is no point in adding unnecessary complications of probably little value in improving ranking estimates.

First off: It is very very unlikely that you will find a perfect system. Every system will have a flaw somewhere.
And to answer your question: Perhaps the ideas here will help: Lehman Rating on OkBridge.
This rating system is in use (since 1993!) on the internet bridge site called OKBridge. Bridge is a partnership game and is usually played with a team of 2 opposing another team of 2. The rating system was devised to rate the individual players and caters to the fact that many people play with different partners.

Without any background in this area, it seems to me a ranking systems is basically a statistical model. A good model will converge to a consistent ranking over time, and the goal would be to converge as quickly as possible. Several thoughts occur to me, several of which have been touched upon in other postings:
Clearly, established players have a track record and new players don't. So the uncertainty is probably greater for new players, although for inconsistent players it could be very high. Also, this probably depends on whether the game primarily uses innate skills or acquired skills. I would think that you would want a "variance" parameter for each player. The variance could be made up of two parts: a true variance and a "temperature". The temperature is like in simulated annealing, where you have a temperature that cools over time. Presumably, the temperature would cool to zero after enough games have been played.
Are there multiple aspects that come in to play? Like in soccer, you may have good shooters, good passers, guys who have good ball control, etc. Basically, these would be the degrees of freedom in you system (in my soccer analogy, they may or may not be truly independent). It seems like an accurate model would take these into account, of course you could have a black box model that implicitly handles these. However, I would expect understanding the number of degrees of freedom in you system would be helpful in choosing the black box.
How do you divide teams? Your teaming algorithm implies a model of what makes equal teams. Maybe you could use this model to create a weighting for each player and/or an expected performance level. If there are different aspects of player skills, maybe you could give extra points for players whose performance in one aspect is significantly better than expected.
Is the game truly win or lose, or could the score differential come in to play? Since you said no ties this probably doesn't apply, but at the very least a close score may imply a higher uncertainty in the outcome.
If you're creating a model from scratch, I would design with the intent to change. At a minimum, I would expect there may be a number of parameters that would be tunable, and might even be auto tuning. For example, as you have more players and more games, the initial temperature and initial ratings values will be better known (assuming you are tracking the statistics). But I would certainly anticipate that the more games have been played the better the model you could build.
Just a bunch of random thoughts, but it sounds like a fun problem.

There was an article in Game Developer Magazine a few years back by some guys from the TrueSkill team at Microsoft, explaining some of their reasoning behind the decisions there. It definitely mentioned teams games for Xbox Live, so it should be at least somewhat relevant. I don't have a direct link to the article, but you can order the back issue here: http://www.gdmag.com/archive/oct06.htm
One specific point that I remember from the article was scoring the team as a whole, instead of e.g. giving more points to the player that got the most kills. That was to encourage people to help the team win instead of just trying to maximize their own score.
I believe there was also some discussion on tweaking the parameters to try to accelerate convergence to an accurate evaluation of the player skill, which sounds like what you're interested in.
Hope that helps...

how is the 'scoring' settled?,
if a team would score 25 points in total (scores of all players in the team) you could divide the players score by the total team score * 100 to get the percentage of how much that player did for the team (or all points with both teams).
You could calculate a score with this data,
and if the percentage is lower than i.e 90% of the team members (or members of both teams):
treat the player as a novice and calculate the score with a different weighing factor.
sometimes an easier concept works out better.

The first question has a very 'gamey' solution. you can either create a newbie lobby for the first couple of games where the players can't see their score yet until they finish a certain amount of games that give you enough data for accurate rating.
Another option is a variation on the first but simpler-give them a single match vs AI that will be used to determine beginning score (look at quake live for an example).

For anyone who stumbles in here years after it was posted: TrueSkill now supports teams made up of multiple players and changing configurations.

Every time 10 players want to play,
they are divided into two (somewhat)
even teams and play each other.
This is interesting, as it implies both that the average skill level on each team is equal (and thus unimportant) and that each team has an equal chance of winning. If you assume this constraint to hold true, a simple count of wins vs losses for each individual player should be as good a measure as any.

Related

How to choose matchups in an ELO ratings system as matchups accumulate

I'm working on a crowdsourced app that will pit about 64 fictional strongmen/strongwomen from different franchises against one another and try and determine who the strongest is. (Think "Batman vs. Spiderman" writ large). Users will choose the winner of any given matchup between two at a time.
After researching many sorting algorithms, I found this fantastic SO post outlining the ELO rating system, which seems absolutely perfect. I've read up on the system and understand both how to award/subtract points in a matchup and how to calculate the performance rating between any two characters based on past results.
What I can't seem to find is any efficient and sensible way to determine which two characters to pit against one another at a given time. Naturally it will start off randomly, but quickly points will accumulate or degrade. We can expect a lot of disagreement but also, if I design this correctly, a large amount of user participation.
So imagine you arrive at this feature after 50,000 votes have been cast. Given that we can expect all sorts of non-transitive results under the hood, and a fair amount of deviance from the performance ratings, is there a way to calculate which matchups I most need more data on? It doesn't seem as simple as choosing two adjacent characters in a sorted list with the closest scores, or just focusing at the top of the list.
With 64 entrants (and yes, I did consider and reject a bracket!), I'm not worried about recomputing the performance ratings after every matchup. I just don't know how to choose the next one, seeing as we'll be ignorant of each voter's biases and favorite characters.

The amazing variation that you experience with multiplayer games is that different people with different ratings "queue up" at different times.
By the ELO system, ideally all players should be matched up with an available player with the closest score to them. Since, if I understand correctly, the 64 "players" in your game are always available, this combination leads to lack of variety, as optimal match ups will always be, well, optimal.
To resolve this, I suggest implementing a priority queue, based on when your "players" feel like playing again. For example, if one wants to take a long break, they may receive a low priority and be placed towards the end of the queue, meaning it will be a while before you see them again. If one wants to take a short break, maybe after about 10 matches, you'll see them in a match again.
This "desire" can be done randomly, and you can assign different characteristics to each character to skew this behaviour, such as, "winning against a higher ELO player will make it more likely that this player will play again sooner". From a game design perspective, these personalities would make the characters seem more interesting to me, making me want to stick around.
So here you have an ordered list of players who want to play. I can think of three approaches you might take for the actual matchmaking:
Peek at the first 5 players in the queue and pick the best match up
Match the first player with their best match in the next 4 players in the queue (presumably waited the longest so should be queued immediately, regardless of the fairness of the match up)
A combination of both, where if the person at the head of the list doesn't get picked, they'll increase in "entropy", which affects the ELO calculation making them more likely to get matched up
Edit
On an implementation perspective, I'd recommend using a delta list instead of an actual priority queue since players should be "promoted" as they wait.

To avoid obvious winner vs looser situation you group the players in tiers.
Obviously, initially everybody will be in the same tier [0 - N1].
Then within the tier you make a rotational schedule so each two parties can "match" at least once.
However if you don't want to maintain schedule ...then always match with the party who participated in the least amount of "matches". If there are multiple of those make a random pick.
This way you ensure that everybody participates fairly the same amount of "matches".

Viable use of genetic algorithms to train neural nets in a poker bot?

I am designing a bot to play Texas Hold'Em Poker on tables of up to ten players, and the design includes a few feed forward neural networks (FFNN). These neural nets each have 8 to 12 inputs, 2 to 6 outputs, and 1 or 2 hidden layers, so there are a few hundred weights that I have to optimize. My main issue with training through back propagation is getting enough training data. I play poker in my spare time, but not enough to gather data on my own. I have looked into purchasing a few million hands off of a poker site, but I don't think my wallet will be very happy with me if I do... So, I have decided on approaching this by designing a genetic algorithm. I have seen examples of FFNNs being trained to play games like Super Mario and Tetris using genetic algorithms, but never for a game like poker, so I want to know if this is a viable approach to training my bot.
First, let me give a little background information (this may be confusing if you are unfamiliar with poker). I have a system in place that allows the bot to put its opponents on a specific range of hands so that it can make intelligent decisions accordingly, but it relies entirely on accurate output from three different neural networks:
NN_1) This determines how likely it is that an opponent is a) playing the actual value of his hand, b) bluffing, or c) playing a hand with the potential to become stronger later on.
NN_2) This assumes the opponent is playing the actual value of his hand and outputs the likely strength. It represents option (a) from the first neural net.
NN_3) This does the same thing as NN_2 but instead assumes the opponent is bluffing, representing option (b).
Then I have an algorithm for option (c) that does not use a FFNN. The outputs for (a), (b), and (c) are then combined based on the output from NN_1 to update my opponent's range.
Whenever the bot is faced with a decision (i.e. should it fold, call, or raise?), it calculates which is most profitable based on its opponents' hand ranges and how they are likely to respond to different bet sizes. This is where the fourth and final neural net comes in. It takes inputs based on properties unique to each player and the state of the table, and it outputs the likelihood of the opponent folding, calling, or raising.
The bot will also have a value for aggression (how likely it is to raise instead of call) and its opening range (which hands to play pre-flop). These four neural networks and two values will define each generation of bots in my genetic algorithm.
Here is my plan for training:
I will be simulating multiple large tournaments with 10n initial bots each with random values for everything. For the first few dozen tournaments, they will all be placed on tables of 10. They will play until either one bot is left or they play, say, 1,000 hands. If they reach that hand limit, the remaining bots will instantly go all-in every hand until one is left. After each table has completed, the most accurate FFNNs will be placed in the winning bot that will move on to the next round (even if the bot containing the best FFNN was not the winner). The winning bot will retain its aggression and opening range values. The tournament ends when only 100 bots remain, and random variations on those bots will generate the players for the next tournament. I'm assuming the first few tournaments will be complete chaos, so I don't want to narrow down my options too much early on.
If by some miracle, the bots actually develop a profitable, or at least somewhat coherent, strategy (I will check for this periodically), I will begin decreasing the amount of variation between bots. Anyone who plays poker could tell you that there are different types of players each with different strategies. I want to make sure that I am allowing enough room for different strategies to develop throughout this process. Then I may develop some sort of "super bot" that can switch between those different strategies if one is failing.
So, are there any glaring issue with this approach? If so, how would you recommend fixing them? Do you have any advice for speeding up this process or increasing my chances of success? I just want to make sure I'm not about to waste hundreds of hours on something doomed to fail. Also, if this site is not the correct place to be asking this question, please refer me to another website before flagging this. I would really appreciate it. Thanks all!

It will be difficult to use ANN for poker bot. It is better to think for expert system. You can use odds calculator to have numerical evaluation of the hand strength and after that expert system for money management (risk management). ANNs are good in other problems.

Elo rating system without order of game played

I am looking for a rating system similar to the elo rating system in chess.
The problem I have is that the ELO system depends on the order games were played.
eg.
Player A starting Elo 1000
Player B starting Elo 1000
If Player B wins over A he will have lets say 1015 points and A 985.
If A keeps on playing and wins against other people, he will have a higher ranking than B, if B stops playing.
I don't want that. B should still be stronger than A.
How can I realise that?

From this link:
Whole-History Rating (WHR) is a new method to estimate the
time-varying strengths of players involved in paired comparisons. Like
many variations of the Elo rating system, the whole-history approach
is based on the dynamic Bradley-Terry model. But, instead of using
incremental approximations, WHR directly computes the exact maximum a
posteriori over the whole rating history of all players.
It's a rating system without order of game played, like the one you asked, but it doesn't solve your issue with Elo.
On the other hand, many post-Elo ranking systems, as Glicko (for chess), or TrueSkill (for X-box games), or rankade (our multipurpose ranking system) have some 'activity dynamics feature' to avoid 'parking the bus' approach (a player gets a high level in ranking, then he stops playing), indeed.

There are a number of schemes which amount to writing down the win/lose/draw record as a matrix and then typically calculating the largest eigenvalue of some matrix related to this. One summary is at http://java.dzone.com/articles/ranking-systems-what-ive, which points to more technical papers including https://umdrive.memphis.edu/ccrousse/public/MATH%207375/PERRON.pdf - "The Perron-Frobenius Theorem and the Ranking of Football Teams".
If you can get more information out of the game than just win/lose/draw you might do better by using this. Some work on soccer has used the number of goals for and against at each match to try and work out the strengths of each team's offense and defense separately (and I do realise that soccer doesn't have separate offensive and defensive teams). In soccer it is reasonable to model the number of goals scored as a Poisson process. One deduction from this, by the way, is that soccer is inherently a pretty uncertain game, and that predicting score draws, as required in some gambles, is especially uncertain. I try and remember the inevitable uncertainty every time England play a game :-).

Calculating scores from incomplete league tables

When I was in high school and learning about matrices, we were shown a technique that would help in a situation like this:
There are a number of chess players in a league, and they need to determine a ranking for all of them, but don't have enough time for every player to play every other person. If it ends up that Player A beats Player B, and Player B beats Player C, you can say with some level of certainty that Player A is better than Player C and therefore award some points to player A in lieu of them actually playing each other.
As I said, this was a little while ago and I can't remember how to actually perform the algorithm, but I think it was called something like a "domination matrix". Searching the web for that has been fruitless and scary at times, so I don't think that's right.
Can anyone give me some help? Ideally an algorithm I can use for this program I'm working on, but even just a pointer to some more information about the procedure.

It sounds like you are remembering a presentation of the Perron-Frobenius theorem - which is at least a safer search term :-). One such is at
http://www.math.utah.edu/~keener/lectures/rankings.pdf
Chess players use the Elo system, described at http://en.wikipedia.org/wiki/Elo_rating_system and http://www.chesselo.com/, which would be easier to implement. It is possible that there is no good ranking even if you know everything - see http://en.wikipedia.org/wiki/Nontransitive_dice. People modelling soccer games usually keep track of defensive and offensive strengths separately.

What it sounds like you are describing is a Swiss System tournament or a very similar variation all described on the linked Wikipedia entry. Although rather than given an incomplete tournament to calculate ratings it is a way to organize a tournament to pair the best chess players with the best and the worst chess players with the worst to determine a ranking without the need for everyone to play everyone else.

Maybe some type of PageRank algorithm might work for you.
Imagine every person has a webpage in which they hyperlink to every person who defeated them.
Running the page rank algorithm on this data would give you give you the steady state of your link matrix which might indicate to you the relative importance of each person (I guess).
For example a person who played only one game but, in that, defeated someone who defeated lots of people might have a higher page rank than somebody who defeated 10 people who in turn have not won a single game.

perhaps the min-max algorithm ?

How to rank a million images with a crowdsourced sort

I'd like to rank a collection of landscape images by making a game whereby site visitors can rate them, in order to find out which images people find the most appealing.
What would be a good method of doing that?
Hot-or-Not style? I.e. show a single image, ask the user to rank it from 1-10. As I see it, this allows me to average the scores, and I would just need to ensure that I get an even distribution of votes across all the images. Fairly simple to implement.
Pick A-or-B? I.e. show two images, ask user to pick the better one. This is appealing as there is no numerical ranking, it's just a comparison. But how would I implement it? My first thought was to do it as a quicksort, with the comparison operations being provided by humans, and once completed, simply repeat the sort ad-infinitum.
How would you do it?
If you need numbers, I'm talking about one million images, on a site with 20,000 daily visits. I'd imagine a small proportion might play the game, for the sake of argument, lets say I can generate 2,000 human sort operations a day! It's a non-profit website, and the terminally curious will find it through my profile :)

As others have said, ranking 1-10 does not work that well because people have different levels.
The problem with the Pick A-or-B method is that its not guaranteed for the system to be transitive (A can beat B, but B beats C, and C beats A). Having nontransitive comparison operators breaks sorting algorithms. With quicksort, against this example, the letters not chosen as the pivot will be incorrectly ranked against each other.
At any given time, you want an absolute ranking of all the pictures (even if some/all of them are tied). You also want your ranking not to change unless someone votes.
I would use the Pick A-or-B (or tie) method, but determine ranking similar to the Elo ratings system which is used for rankings in 2 player games (originally chess):
The Elo player-rating
system compares players’ match records
against their opponents’ match records
and determines the probability of the
player winning the matchup. This
probability factor determines how many
points a players’ rating goes up or
down based on the results of each
match. When a player defeats an
opponent with a higher rating, the
player’s rating goes up more than if
he or she defeated a player with a
lower rating (since players should
defeat opponents who have lower
ratings).
The Elo System:
All new players start out with a base rating of 1600
WinProbability = 1/(10^(( Opponent’s Current Rating–Player’s Current Rating)/400) + 1)
ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a draw.
Player’s New Rating = Player’s Old Rating + (K-Value * (ScoringPt–Player’s Win Probability))
Replace "players" with pictures and you have a simple way of adjusting both pictures' rating based on a formula. You can then perform a ranking using those numeric scores. (K-Value here is the "Level" of the tournament. It's 8-16 for small local tournaments and 24-32 for larger invitationals/regionals. You can just use a constant like 20).
With this method, you only need to keep one number for each picture which is a lot less memory intensive than keeping the individual ranks of each picture to each other picture.
EDIT: Added a little more meat based on comments.

Most naive approaches to the problem have some serious issues. The worst is how bash.org and qdb.us displays quotes - users can vote a quote up (+1) or down (-1), and the list of best quotes is sorted by the total net score. This suffers from a horrible time bias - older quotes have accumulated huge numbers of positive votes via simple longevity even if they're only marginally humorous. This algorithm might make sense if jokes got funnier as they got older but - trust me - they don't.
There are various attempts to fix this - looking at the number of positive votes per time period, weighting more recent votes, implementing a decay system for older votes, calculating the ratio of positive to negative votes, etc. Most suffer from other flaws.
The best solution - I think - is the one that the websites The Funniest The Cutest, The Fairest, and Best Thing use - a modified Condorcet voting system:
The system gives each one a number based on, out of the things that it has faced, what percentage of them it usually beats. So each one gets the percentage score NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). Also, things are barred from the top list until they've been compared to a reasonable percentage of the set.
If there's a Condorcet winner in the set, this method will find it. Since that's unlikely, given the statistical nature, it finds the one that's the "closest" to being a Condorcet winner.
For more information on implementing such systems the Wikipedia page on Ranked Pairs should be helpful.
The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's a good thing. I believe it's very well accepted in decision theory that humans are vastly better at comparing two objects than they are at abstract ranking. Millions of years of evolution make us good at picking the best apple off the tree, but terrible at deciding how closely the apple we picked hews to the true Platonic Form of appleness. (This is, by the way, why the Analytic Hierarchy Process is so nifty...but that's getting a bit off topic.)
One final point to make is that SO uses an algorithm to find the best answers which is very similar to bash.org's algorithm to find the best quote. It works well here, but fails terribly there - in large part because an old, highly rated, but now outdated answer here is likely to be edited. bash.org doesn't allow editing, and it's not clear how you'd even go about editing decade-old jokes about now-dated internet memes even if you could... In any case, my point is that the right algorithm usually depends on the details of your problem. :-)

I know this question is quite old but I thought I'd contribute
I'd look at the TrueSkill system developed at Microsoft Research. It's like ELO but has a much faster convergence time (looks exponential compared to linear), so you get more out of each vote. It is, however, more complex mathematically.
http://en.wikipedia.org/wiki/TrueSkill

I don't like the Hot-or-Not style. Different people would pick different numbers even if they all liked the image exactly the same. Also I hate rating things out of 10, I never know which number to choose.
Pick A-or-B is much simpler and funner. You get to see two images, and comparisons are made between the images on the site.

These equations from Wikipedia makes it simpler/more effective to calculate Elo ratings, the algorithm for images A and B would be simple:
Get Ne, mA, mB and ratings RA,RB from your database.
Calculate KA ,KB, QA, QB by using the number of comparisons performed (Ne) and the number of times that image was compared (m) and current ratings :
Calculate EA and EB.
Score the winner's S : the winner as 1, loser as 0, and if you have a draw as 0.5,
Calculate the new ratings for both using:
Update the new ratings RA,RB and counts mA,mB in the database.

You may want to go with a combination.
First phase:
Hot-or-not style (although I would go with a 3 option vote: Sucks, Meh/OK. Cool!)
Once you've sorted the set into the 3 buckets, then I would select two images from the same bucket and go with the "Which is nicer"
You could then use an English Soccer system of promotion and demotion to move the top few "Sucks" into the Meh/OK region, in order to refine the edge cases.

Ranking 1-10 won't work, everyone has different levels. Someone who always gives 3-7 ratings would have his rankings eclipsed by people who always give 1 or 10.
a-or-b is more workable.

Wow, I'm late in the game.
I like the ELO system very much so, but like Owen says it seems to me that you'd be slow building up any significant results.
I believe humans have much greater capacity than just comparing two images, but you want to keep interactions to the bare minimum.
So how about you show n images (n being any number you can visibly display on a screen, this may be 10, 20, 30 depending on user's preference maybe) and get them to pick which they think is best in that lot. Now back to ELO. You need to modify you ratings system, but keep the same spirit. You have in fact compared one image to n-1 others. So you do your ELO rating n-1 times, but you should divide the change of rating by n-1 to match (so that results with different values of n are coherent with one another).
You're done. You've now got the best of all worlds. A simple rating system working with many images in one click.

If you prefer using the Pick A or B strategy I would recommend this paper: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf
Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013,
February). Pairwise ranking aggregation in a crowdsourced setting. In
Proceedings of the sixth ACM international conference on Web search
and data mining (pp. 193-202). ACM.
The paper tells about the Crowd-BT model which extends the famous Bradley-Terry pairwise comparison model into crowdsource setting. It also gives an adaptive learning algorithm to enhance the time and space efficiency of the model. You can find a Matlab implementation of the algorithm on Github (but I'm not sure if it works).

The defunct web site whatsbetter.com used an Elo style method. You can read about the method in their FAQ on the Internet Archive.

Pick A-or-B its the simplest and less prone to bias, however at each human interaction it gives you substantially less information. I think because of the bias reduction, Pick is superior and in the limit it provides you with the same information.
A very simple scoring scheme is to have a count for each picture. When someone gives a positive comparison increment the count, when someone gives a negative comparison, decrement the count.
Sorting a 1-million integer list is very quick and will take less than a second on a modern computer.
That said, the problem is rather ill-posed - It will take you 50 days to show each image only once.
I bet though you are more interested in the most highly ranked images? So, you probably want to bias your image retrieval by predicted rank - so you are more likely to show images that have already achieved a few positive comparisons. This way you will more quickly just start showing 'interesting' images.

I like the quick-sort option but I'd make a few tweeks:
Keep the "comparison" results in a DB and then average them.
Get more than one comparison per view by giving the user 4-6 images and having them sort them.
Select what images to display by running qsort and recording and trimming anything that you don't have enough data on. Then when you have enough items recorded, spit out a page.
The other fun option would be to use the crowd to teach a neural-net.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio