Ranked Feed Algorithm

Ranked Feed Algorithm - algorithm

I'm building a sports newsfeed for an app and I'd like it to be sorted on popularity as well as chronologically. I've implemented the sorting using the open-source reddit algorithm (my app has likes for each post in the newsfeed). So far I've tested it and it seems to be working well but there's one main problem I've encountered: News about popular sports always show up above news from other sports. Example: My app has 100,000 basketball fans and 1,000 soccer fans. A big news about soccer comes out. It'll still have less likes than the other regular daily basketball news. How can I resolve this issue? One possible solution I considered is feeding the reddit algorithm the % of all fans that liked a certain post.

I suggest that you normalize the percentage across your fan base. "Popularity" should measure not only percentage of up-votes, but relative percentage within the fan base.
For each article, count the up-votes. Next, convert this to a Z-score: how many standard deviations above/below the mean this article was rated, within the fan base for that sport. Use this in place of the quantity of votes.

Related

How to implement personalized feed ranking?

I have an app that aggregates various sports content (news articles, videos, discussions from users, tweets) and I'm currently working on having it so that it'll display relevant content to the users. Each post has a like button so I'm using that to determine what's popular. I'm using the reddit algorithm to have it sorted on popularity but also factor in time. However, my problem is that I want to make it more personalized for each user. Each user should see more content based on what they like. I have several factors I'm measuring:
- How many of each content they watch/click on? Ex: 60% videos and 40% articles
- What teams/players they like? If a news is about a team they like, it should be weighed more heavily
- What sport they like more? Users can follow several sports
What I'm currently doing:
For each of the factors listed above, I'll increase the popularity score by X of an article. Ex: user likes videos 70% than other content. I'll increase the score of videos by 70%.
I'm looking to see if there's better ways to do this? I've been told machine learning would be a good way but I wanted to see if there are any alternatives out there.

It sounds like what your doing is a great place to start with personalizing your users feeds.
Ranking based on popularity metrics (likes, comments, etc), recency, and in you case content type is the basis of the EdgeRank algorithm that Facebook used to use.
There are a lot of metrics that you can apply to try and boost engagement. Something
user liked post from team x, y times, so boost activity in feed by log(x) if post if is from y, boost activity if it’s newer, boost activity if it’s popular, etc… You can start to see that these EdgeRank algorithms can get a bit unwieldy rather quickly the more metrics you track. Also all the hyper-parameters that you set tend to be fixed for each user, which won’t end up with the ideal ranking algorithm for every user. Which is where machine learning techniques can come into play.
The main class of algorithms that deal with this sort of thing are often called Learning to Rank, and can be on a high level generalized into 3 categories. Collaborative filtering techniques, content based techniques, and hybrid techniques (blend of the first two)
In you case with a feed that most likely gets updated fairly frequently with new items, I would take a look at content based methods. Typically these algorithms are optimized around engagement metrics such as likelihood that the user is going to click, view, comment, or like an activity within their feed.
A little bit of self-promotion: I wrote a couple blog posts that cover some of this that you may find interesting.
https://getstream.io/blog/instagram-discovery-engine-tutorial/
https://getstream.io/blog/beyond-edgerank-personalized-news-feeds/
This can be a lot a lot to take on, so you could also take a look at using a 3rd party service like Stream (disclaimer, I do work there) who helps developers build scalable, personalized feeds.

Recommender: Log user actions & datamine it – good solution [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am planning to log all user actions like viewed page, tag etc.
What would be a good lean solution to data-mine this data to get recommendations?
Say like:
Figure all the interests from the viewed URL (assuming I know the
associated tags)
Find out people who have similar interests. E.g. John & Jane
viewed URLS related to cars etc
Edit:
It’s really my lack of knowledge in this domain that’s a limiting factor to get started.
Let me rephrase.
Lets say a site like stackoverflow or Quora. All my browsing history going through different questions are recorded and Quora does a data mining job of looking through it and populating my stream with related questions. I go through questions relating to parenting and the next time I login I see streams of questions about parenting. Ditto with Amazon shopping. I browse watches & mixers and two days later they send me a mail of related shopping items that I am interested.
My question is, how do they efficiently store these data and then data mine it to show the next relevant set of data.

Datamining is a method that needs really enormous amounts of space for storage and also enormous amounts of computing power.
I give you an example:
Imagine, you are the boss of a big chain of supermarkets like Wal-Mart, and you want to find out how to place your products in your market so that consumers spend lots of money when they enter your shops.
First of all, you need an idea. Your idea is to find products of different product-groups that are often bought together. If you have such a pair of products, you should place those products as far away as possible. If a customer wants to buy both, he/she has to walk through your whole shop and on this way you place other products that might fit well to one of that pair, but are not sold as often. Some of the customers will see this product and buy it, and the revenue of this additional product is the revenue of your datamining-process.
So you need lots of data. You have to store all data that you get from all buyings of all your customers in all your shops. When a person buys a bottle of milk, a sausage and some bread, then you need to store what goods have been sold, in what amount, and the price. Every buying needs its own ID if you want to get noticed that the milk and the sausage have been bought together.
So you have a huge amount of data of buyings. And you have a lot of different products. Let’s say, you are selling 10.000 different products in your shops. Every product can be paired with every other. This makes 10,000 * 10,000 / 2 = 50,000,000 (50 Million) pairs. And for each of this possible pairs you have to find out, if it is contained in a buying. But maybe you think that you have different customers at a Saturday afternoon than at a Wednesday late morning. So you have to store the time of buying too. Maybee you define 20 time slices along a week. This makes 50M * 20 = 1 billion records. And because people in Memphis might buy different things than people in Beverly Hills, you need the place too in your data. Lets say, you define 50 regions, so you get 50 billion records in your database.
And then you process all your data. If a customer did buy 20 products in one buying, you have 20 * 19 / 2 = 190 pairs. For each of this pair you increase the counter for the time and the place of this buying in your database. But by what should you increase the counter? Just by 1? Or by the amount of the bought products? But you have a pair of two products. Should you take the sum of both? Or the maximum? Better you use more than one counter to be able to count it in all ways you can think of.
And you have to do something else: Customers buy much more milk and bread then champagne and caviar. So if they choose arbitrary products, of course the pair milk-bread has a higher count than the pair champagne-caviar. So when you analyze your data, you must take care of some of those effects too.
Then, when you have done this all you do your datamining-query. You select the pair with the highest ratio of factual count against estimated count. You select it from a database-table with many billion records. This might need some hours to process. So think carefully if your query is really what you want to know before you submit your query!
You might find out that in rural environment people on a Saturday afternoon buy much more beer together with diapers than you did expect. So you just have to place beer at one end of the shop and diapers on the other end, and this makes lots of people walking through your whole shop where they see (and hopefully buy) many other things they wouldn't have seen (and bought) if beer and diapers was placed close together.
And remember: the costs of your datamining-process are covered only by the additional bargains of your customers!
conclusion:
You must store pairs, triples of even bigger tuples of items which will need a lot of space. Because you don't know what you will find at the end, you have to store every possible combination!
You must count those tuples
You must compare counted values with estimated values

Store each transaction as a vector of tags (i.e. visited pages containing these tags). Then do association analysis (i can recommend Weka) on this data to find associations using the "Associate" algorithms available. Effectiveness depends on a lot of different things of course.
One thing that a guy at my uni told me was that often you can simply create a vector of all the products that one person has bought and compare this with other peoples vectors and get decent recommendations. That is represent users as the products they buy or the pages they visit and do e.g. Jaccard similarity calculations. If the "people" are similar then look at products they bought that this person didn't. (Probably those that are the most common in the population of similar people)
Storage is a whole different ballgame, there are many good indices for vector data such as KD trees implemented in different RDBMs.
Take a course in datamining :) or just read one of the excellent textbooks available (I have read Introduction to data mining by Pang-Ning tan et al and its good.)
And regarding storing all the pairs of products etc, of course this is not done and more efficient algorithms based on support and confidence are used to prune the search space.

I should say recommendation is machine learning issue.
how to store the datas depends on which algorithm you chose.

Player rating for game with random teams

I am working on an algorithm to score individual players in a team-based game. The problem is that no fixed teams exist - every time 10 players want to play, they are divided into two (somewhat) even teams and play each other. For this reason, it makes no sense to score the teams, and instead we need to rely on individual player ratings.
There are a number of problems that I wish to take into account:
New players need some sort of provisional ranking to reach their "real" rating, before their rating counts the same as seasoned players.
The system needs to take into account that a team may consist of a mix of player skill levels - eg. one really good, one good, two mediocre, and one really poor. Therefore a simple "average" of player ratings probably won't suffice and it probably needs to be weighted in some way.
Ratings are adjusted after every game and as such the algorithm needs to be based on a per-game basis, not per "rating period". This might change if a good solution comes up (I am aware that Glicko uses a rating period).
Note that cheating is not an issue for this algorithm, since we have other measures of validating players.
I have looked at TrueSkill, Glicko and ELO (which is what we're currently using). I like the idea of TrueSkill/Glicko where you have a deviation that is used to determine how precise a rating is, but none of the algorithms take the random teams perspective into account and seem to be mostly based on 1v1 or FFA games.
It was suggested somewhere that you rate players as if each player from the winning team had beaten all the players on the losing team (25 "duels"), but I am unsure if that is the right approach, since it might wildly inflate the rating when a really poor player is on the winning team and gets a win vs. a very good player on the losing team.
Any and all suggestions are welcome!
EDIT: I am looking for an algorithm for established players + some way to rank newbies, not the two combined. Sorry for the confusion.
There is no AI and players only play each other. Games are determined by win/loss (there is no draw).

Provisional ranking systems are always imperfect, but the better ones (such as Elo) are designed to adjust provisional ratings more quickly than for ratings of established players. This acknowledges that trying to establish an ability rating off of just a few games with other players will inherently be error-prone.
I think you should use the average rating of all players on the opposing team as the input for establishing the provisional rating of the novice player, but handle it as just one game, not as N games vs. N players. Each game is really just one data sample, and the Elo system handles accumulation of these games to improve the ranking estimate for an individual player over time before switching over to the normal ranking system.
For simplicity, I would also not distinguish between established and provisional ratings for members of the opposing team when calculating a new provision rating for some member of the other team (unless Elo requires this). All of these ratings have implied error, so there is no point in adding unnecessary complications of probably little value in improving ranking estimates.

First off: It is very very unlikely that you will find a perfect system. Every system will have a flaw somewhere.
And to answer your question: Perhaps the ideas here will help: Lehman Rating on OkBridge.
This rating system is in use (since 1993!) on the internet bridge site called OKBridge. Bridge is a partnership game and is usually played with a team of 2 opposing another team of 2. The rating system was devised to rate the individual players and caters to the fact that many people play with different partners.

Without any background in this area, it seems to me a ranking systems is basically a statistical model. A good model will converge to a consistent ranking over time, and the goal would be to converge as quickly as possible. Several thoughts occur to me, several of which have been touched upon in other postings:
Clearly, established players have a track record and new players don't. So the uncertainty is probably greater for new players, although for inconsistent players it could be very high. Also, this probably depends on whether the game primarily uses innate skills or acquired skills. I would think that you would want a "variance" parameter for each player. The variance could be made up of two parts: a true variance and a "temperature". The temperature is like in simulated annealing, where you have a temperature that cools over time. Presumably, the temperature would cool to zero after enough games have been played.
Are there multiple aspects that come in to play? Like in soccer, you may have good shooters, good passers, guys who have good ball control, etc. Basically, these would be the degrees of freedom in you system (in my soccer analogy, they may or may not be truly independent). It seems like an accurate model would take these into account, of course you could have a black box model that implicitly handles these. However, I would expect understanding the number of degrees of freedom in you system would be helpful in choosing the black box.
How do you divide teams? Your teaming algorithm implies a model of what makes equal teams. Maybe you could use this model to create a weighting for each player and/or an expected performance level. If there are different aspects of player skills, maybe you could give extra points for players whose performance in one aspect is significantly better than expected.
Is the game truly win or lose, or could the score differential come in to play? Since you said no ties this probably doesn't apply, but at the very least a close score may imply a higher uncertainty in the outcome.
If you're creating a model from scratch, I would design with the intent to change. At a minimum, I would expect there may be a number of parameters that would be tunable, and might even be auto tuning. For example, as you have more players and more games, the initial temperature and initial ratings values will be better known (assuming you are tracking the statistics). But I would certainly anticipate that the more games have been played the better the model you could build.
Just a bunch of random thoughts, but it sounds like a fun problem.

There was an article in Game Developer Magazine a few years back by some guys from the TrueSkill team at Microsoft, explaining some of their reasoning behind the decisions there. It definitely mentioned teams games for Xbox Live, so it should be at least somewhat relevant. I don't have a direct link to the article, but you can order the back issue here: http://www.gdmag.com/archive/oct06.htm
One specific point that I remember from the article was scoring the team as a whole, instead of e.g. giving more points to the player that got the most kills. That was to encourage people to help the team win instead of just trying to maximize their own score.
I believe there was also some discussion on tweaking the parameters to try to accelerate convergence to an accurate evaluation of the player skill, which sounds like what you're interested in.
Hope that helps...

how is the 'scoring' settled?,
if a team would score 25 points in total (scores of all players in the team) you could divide the players score by the total team score * 100 to get the percentage of how much that player did for the team (or all points with both teams).
You could calculate a score with this data,
and if the percentage is lower than i.e 90% of the team members (or members of both teams):
treat the player as a novice and calculate the score with a different weighing factor.
sometimes an easier concept works out better.

The first question has a very 'gamey' solution. you can either create a newbie lobby for the first couple of games where the players can't see their score yet until they finish a certain amount of games that give you enough data for accurate rating.
Another option is a variation on the first but simpler-give them a single match vs AI that will be used to determine beginning score (look at quake live for an example).

For anyone who stumbles in here years after it was posted: TrueSkill now supports teams made up of multiple players and changing configurations.

Every time 10 players want to play,
they are divided into two (somewhat)
even teams and play each other.
This is interesting, as it implies both that the average skill level on each team is equal (and thus unimportant) and that each team has an equal chance of winning. If you assume this constraint to hold true, a simple count of wins vs losses for each individual player should be as good a measure as any.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!

Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx

This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.

There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA

The K-nearest neighbor algorithm may be right up your alley.

Check out some of the work of the top teams for the netflix prize.

How to rank a million images with a crowdsourced sort

I'd like to rank a collection of landscape images by making a game whereby site visitors can rate them, in order to find out which images people find the most appealing.
What would be a good method of doing that?
Hot-or-Not style? I.e. show a single image, ask the user to rank it from 1-10. As I see it, this allows me to average the scores, and I would just need to ensure that I get an even distribution of votes across all the images. Fairly simple to implement.
Pick A-or-B? I.e. show two images, ask user to pick the better one. This is appealing as there is no numerical ranking, it's just a comparison. But how would I implement it? My first thought was to do it as a quicksort, with the comparison operations being provided by humans, and once completed, simply repeat the sort ad-infinitum.
How would you do it?
If you need numbers, I'm talking about one million images, on a site with 20,000 daily visits. I'd imagine a small proportion might play the game, for the sake of argument, lets say I can generate 2,000 human sort operations a day! It's a non-profit website, and the terminally curious will find it through my profile :)

As others have said, ranking 1-10 does not work that well because people have different levels.
The problem with the Pick A-or-B method is that its not guaranteed for the system to be transitive (A can beat B, but B beats C, and C beats A). Having nontransitive comparison operators breaks sorting algorithms. With quicksort, against this example, the letters not chosen as the pivot will be incorrectly ranked against each other.
At any given time, you want an absolute ranking of all the pictures (even if some/all of them are tied). You also want your ranking not to change unless someone votes.
I would use the Pick A-or-B (or tie) method, but determine ranking similar to the Elo ratings system which is used for rankings in 2 player games (originally chess):
The Elo player-rating
system compares players’ match records
against their opponents’ match records
and determines the probability of the
player winning the matchup. This
probability factor determines how many
points a players’ rating goes up or
down based on the results of each
match. When a player defeats an
opponent with a higher rating, the
player’s rating goes up more than if
he or she defeated a player with a
lower rating (since players should
defeat opponents who have lower
ratings).
The Elo System:
All new players start out with a base rating of 1600
WinProbability = 1/(10^(( Opponent’s Current Rating–Player’s Current Rating)/400) + 1)
ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a draw.
Player’s New Rating = Player’s Old Rating + (K-Value * (ScoringPt–Player’s Win Probability))
Replace "players" with pictures and you have a simple way of adjusting both pictures' rating based on a formula. You can then perform a ranking using those numeric scores. (K-Value here is the "Level" of the tournament. It's 8-16 for small local tournaments and 24-32 for larger invitationals/regionals. You can just use a constant like 20).
With this method, you only need to keep one number for each picture which is a lot less memory intensive than keeping the individual ranks of each picture to each other picture.
EDIT: Added a little more meat based on comments.

Most naive approaches to the problem have some serious issues. The worst is how bash.org and qdb.us displays quotes - users can vote a quote up (+1) or down (-1), and the list of best quotes is sorted by the total net score. This suffers from a horrible time bias - older quotes have accumulated huge numbers of positive votes via simple longevity even if they're only marginally humorous. This algorithm might make sense if jokes got funnier as they got older but - trust me - they don't.
There are various attempts to fix this - looking at the number of positive votes per time period, weighting more recent votes, implementing a decay system for older votes, calculating the ratio of positive to negative votes, etc. Most suffer from other flaws.
The best solution - I think - is the one that the websites The Funniest The Cutest, The Fairest, and Best Thing use - a modified Condorcet voting system:
The system gives each one a number based on, out of the things that it has faced, what percentage of them it usually beats. So each one gets the percentage score NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). Also, things are barred from the top list until they've been compared to a reasonable percentage of the set.
If there's a Condorcet winner in the set, this method will find it. Since that's unlikely, given the statistical nature, it finds the one that's the "closest" to being a Condorcet winner.
For more information on implementing such systems the Wikipedia page on Ranked Pairs should be helpful.
The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's a good thing. I believe it's very well accepted in decision theory that humans are vastly better at comparing two objects than they are at abstract ranking. Millions of years of evolution make us good at picking the best apple off the tree, but terrible at deciding how closely the apple we picked hews to the true Platonic Form of appleness. (This is, by the way, why the Analytic Hierarchy Process is so nifty...but that's getting a bit off topic.)
One final point to make is that SO uses an algorithm to find the best answers which is very similar to bash.org's algorithm to find the best quote. It works well here, but fails terribly there - in large part because an old, highly rated, but now outdated answer here is likely to be edited. bash.org doesn't allow editing, and it's not clear how you'd even go about editing decade-old jokes about now-dated internet memes even if you could... In any case, my point is that the right algorithm usually depends on the details of your problem. :-)

I know this question is quite old but I thought I'd contribute
I'd look at the TrueSkill system developed at Microsoft Research. It's like ELO but has a much faster convergence time (looks exponential compared to linear), so you get more out of each vote. It is, however, more complex mathematically.
http://en.wikipedia.org/wiki/TrueSkill

I don't like the Hot-or-Not style. Different people would pick different numbers even if they all liked the image exactly the same. Also I hate rating things out of 10, I never know which number to choose.
Pick A-or-B is much simpler and funner. You get to see two images, and comparisons are made between the images on the site.

These equations from Wikipedia makes it simpler/more effective to calculate Elo ratings, the algorithm for images A and B would be simple:
Get Ne, mA, mB and ratings RA,RB from your database.
Calculate KA ,KB, QA, QB by using the number of comparisons performed (Ne) and the number of times that image was compared (m) and current ratings :
Calculate EA and EB.
Score the winner's S : the winner as 1, loser as 0, and if you have a draw as 0.5,
Calculate the new ratings for both using:
Update the new ratings RA,RB and counts mA,mB in the database.

You may want to go with a combination.
First phase:
Hot-or-not style (although I would go with a 3 option vote: Sucks, Meh/OK. Cool!)
Once you've sorted the set into the 3 buckets, then I would select two images from the same bucket and go with the "Which is nicer"
You could then use an English Soccer system of promotion and demotion to move the top few "Sucks" into the Meh/OK region, in order to refine the edge cases.

Ranking 1-10 won't work, everyone has different levels. Someone who always gives 3-7 ratings would have his rankings eclipsed by people who always give 1 or 10.
a-or-b is more workable.

Wow, I'm late in the game.
I like the ELO system very much so, but like Owen says it seems to me that you'd be slow building up any significant results.
I believe humans have much greater capacity than just comparing two images, but you want to keep interactions to the bare minimum.
So how about you show n images (n being any number you can visibly display on a screen, this may be 10, 20, 30 depending on user's preference maybe) and get them to pick which they think is best in that lot. Now back to ELO. You need to modify you ratings system, but keep the same spirit. You have in fact compared one image to n-1 others. So you do your ELO rating n-1 times, but you should divide the change of rating by n-1 to match (so that results with different values of n are coherent with one another).
You're done. You've now got the best of all worlds. A simple rating system working with many images in one click.

If you prefer using the Pick A or B strategy I would recommend this paper: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf
Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013,
February). Pairwise ranking aggregation in a crowdsourced setting. In
Proceedings of the sixth ACM international conference on Web search
and data mining (pp. 193-202). ACM.
The paper tells about the Crowd-BT model which extends the famous Bradley-Terry pairwise comparison model into crowdsource setting. It also gives an adaptive learning algorithm to enhance the time and space efficiency of the model. You can find a Matlab implementation of the algorithm on Github (but I'm not sure if it works).

The defunct web site whatsbetter.com used an Elo style method. You can read about the method in their FAQ on the Internet Archive.

Pick A-or-B its the simplest and less prone to bias, however at each human interaction it gives you substantially less information. I think because of the bias reduction, Pick is superior and in the limit it provides you with the same information.
A very simple scoring scheme is to have a count for each picture. When someone gives a positive comparison increment the count, when someone gives a negative comparison, decrement the count.
Sorting a 1-million integer list is very quick and will take less than a second on a modern computer.
That said, the problem is rather ill-posed - It will take you 50 days to show each image only once.
I bet though you are more interested in the most highly ranked images? So, you probably want to bias your image retrieval by predicted rank - so you are more likely to show images that have already achieved a few positive comparisons. This way you will more quickly just start showing 'interesting' images.

I like the quick-sort option but I'd make a few tweeks:
Keep the "comparison" results in a DB and then average them.
Get more than one comparison per view by giving the user 4-6 images and having them sort them.
Select what images to display by running qsort and recording and trimming anything that you don't have enough data on. Then when you have enough items recorded, spit out a page.
The other fun option would be to use the crowd to teach a neural-net.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio