How to check user choice algorithm - algorithm

I have an algorithm that chooses a list of items that should fit the user's likings.
I'll skip the algorithm's details because of confidentiality issues...
Now, I'm trying to think of a way to check it statistically, with a group of people.
The way I'm checking it now is:
Algorithm gets best results per user.
shuffle top 5 results with lowest 5 results.
make person list the results he liked by order (0 = liked best, 9 = didn't like)
compare user results to algorithm results.
I'm doing this because i figured that to show that algorithm chooses good results, i need to put in some bad results and show that the algorithm knows its a bad result as well.
So, what I'm asking is:
Is shuffling top results with low results is a good idea ?
And if not, do you have an idea on how to get good statistics on how good an algorithm matches user preferences (we have users that can choose stuff) ?

First ask yourself:
What am I trying to measure?
Not to rag on the other submissions here, but while mjv and Sjoerd's answers offer some plausible heuristic reasons for why what you are trying to do may not work as you expect; they are not constructive in the sense that they do not explain why your experiment is flawed, and what you can do to improve it. Before either of these issues can be addressed, what you need to do is define what you hope to measure, and only then should you go about trying to devise an experiment.
Now, I can't say for certain what would constitute a good metric for your purposes, but I can offer you some suggestions. As a starting point, you could try using a precision vs. recall graph:
http://en.wikipedia.org/wiki/Precision_and_recall
This is a standard technique for assessing the performance of ranking and classification algorithms in machine learning and information retrieval (ie web searching). If you have an engineering background, it could be helpful to understand that precision/recall generalizes the notion of precision/accuracy:
http://en.wikipedia.org/wiki/Accuracy_and_precision
Now let us suppose that your algorithm does something like this; it takes as input some prior data about a user then returns a ranked list of other items that user might like. For example, your algorithm is a web search engine and the items are pages; or you have a movie recommender and the items are books. This sounds pretty close to what you are trying to do now, so let us continue with this analogy.
Then the precision of your algorithm's results on the first n is the number of items that the user actually liked out of your first to top n recommendations:
precision = #(items user actually liked out of top n) / n
And the recall is the number of items that you actually got right out of the total number of items:
recall = #(items correctly marked as liked) / #(items user actually likes)
Ideally, one would want to maximize both of these quantities, but they are in a certain sense competing objectives. To illustrate this, consider a few extremal situations: For example, you could have a recommender that returns everything, which would have perfect recall, but very low precision. A second possibility is to have a recommender that returns nothing or only one sure-fire hit, which would have (in a limiting sense) perfect precision, but almost no recall.
As a result, to understand the performance of a ranking algorithm, people typically look at its precision vs. recall graph. These are just plots of the precision vs the recall as the number of items returned are varied:
Image taken from the following tutorial (which is worth reading):
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
Now to approximate a precision vs recall for your algorithm, here is what you can do. First, return a large set of say n, results as ranked by your algorithm. Next, get the user to mark which items they actually liked out of those n results. This trivially gives us enough information to compute the precision at every partial set of documents < n (since we know the number). We can also compute the recall (as restricted to this set of documents) by taking the total number of items liked by the user in the entire set. This, we can plot a precision recall curve for this data. Now there are fancier statistical techniques for estimating this using less work, but I have already written enough. For more information please check out the links in the body of my answer.

Your method is biased. If you use the top 5 and bottom 5 results, It is very likely that the user orders it according to your algorithm. Let's say we have an algorithm which rates music, and I present the top 1 and bottom 1 to the user:
Queen
The Cheeky Girls
Of course the user will mark it exactly like your algorithm, because the difference between the top and bottom is so big. You need to make the user rate randomly selected items.

Independently of the question of mixing top and bottom guesses, an implicit drawback of the experimental process, as described, is that the data related to the user's choice can only be exploited in the context of one particular version of the algorithm:
When / if the algorithm or its parameters are ever slightly tuned, the record of past user's choices cannot be reused to validate the changes to the algorithm.
On mixing high and low results:
The main drawback of producing sets of items by mixing the algorithm's top and bottom guesses is that it may further complicate the choice of the error/distance function used to measure how well the algorithm performed. Unless the two subsets of items (topmost choices, bottom most choices) are kept separately for the purpose of computing distinct measurements, typical statistical measures of the error (say RMSE) will not be a good measurement of the effective algorithm's quality.
For example, an algorithm which frequently suggests, low guesses items which end up being picked as top choices by the user may have the same averaged error rate than an algorithm which never confuses highs with lows, but where there the user tends to reorders the items more within their subset.
A second drawback is that the algorithm evaluation method may merely qualify its ability of filtering the relative like/dislike of users for items it [the algorithm] chooses rather than its ability of producing the user's actual top choices.
In other words the user's actual top choices may never be offered to him; so yeah the algorithm does a good job at guessing that user will like say Rock-and-Roll before Rap, but never guessing that in fact user prefers Classical Baroque music over all.

Related

A simple explanation of Random Forest

I'm trying to understand how random forest works in plain English instead of mathematics. Can anybody give me a really simple explanation of how this algorithm works?
As far as I understand, we feed the features and labels without telling the algorithm which feature should be classified as which label? As I used to do Naive Bayes which is based on probability we need to tell which feature should be which label. Am I completely far off?
If I can get any very simple explanation I'd be really appreciated.
RandomForest uses a so-called bagging approach. The idea is based on the classic bias-variance trade off. Suppose that we have a set (say N) of overfitted estimators that have low bias but high cross-sample-variance. So low bias is good and we want to keep it, high variance is bad and we want to reduce it. RandomForest tries to achieve this by doing a so-called bootstraps/sub-sampling (as #Alexander mentioned, this is a combination of bootstrap sampling on both observations and features). The prediction is the average of individual estimators so the low-bias property is successfully preserved. And further by Central Limit Theorem, the variance of this sample average has a variance equal to variance of individual estimator divided by square root of N. So now, it has both low-bias and low-variance properties, and this is why RandomForest often outperforms stand-alone estimator.
Adding on to the above two answers, Since you mentioned a simple explanation. Here is a write up that I feel is the most simple way you can explain random forests.
Credits go to Edwin Chen for the simple explanation here in layman terms for random forests. Posting the same below.
Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.
Thus, Willow is a decision tree for your movie preferences.
But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).
Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.
By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.
There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardo DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.
And so your friends now form a random forest.
I will try to give another complementary explanation with simple words.
A random forest is a collection of random decision trees (of number n_estimators in sklearn).
What you need to understand is how to build one random decision tree.
Roughly speaking, to build a random decision tree you start from a subset of your training samples. At each node you will draw randomly a subset of features (number determined by max_features in sklearn). For each of these features you will test different thresholds and see how they split your samples according to a given criterion (generally entropy or gini, criterion parameter in sklearn). Then you will keep the feature and its threshold that best split your data and record it in the node.
When the construction of the tree ends (it can be for different reasons: maximum depth is reached (max_depth in sklearn), minimum sample number is reached (min_samples_leaf in sklearn) etc.) you look at the samples in each leaf and keep the frequency of the labels.
As a result, it is like the tree gives you a partition of your training samples according to meaningful features.
As each node is built from features taken randomly, you understand that each tree built in this way will be different. This contributes to the good compromise between bias and variance, as explained by #Jianxun Li.
Then in testing mode, a test sample will go through each tree, giving you label frequencies for each tree. The most represented label is generally the final classification result.

Uncertainty versus randomness

I would like to know the difference between uncertainty and randomness in mathematical fashion. I tried to find it but I get confused , as some people said they are the same? But can any one provide me logical reasoning behind it. If they are not same then please explain it why?
Don't get too hung up on it.
People use different words in different situations.
It's not so much that they have different meanings, as that their meanings are situation-dependent.
Randomness is just a fuzzy general term meaning something is random.
In statistics, uncertainty is used to mean that some property of a distribution, such as its mean, is itself unknown but can be given a distribution.
For example, suppose you want to know the average weight of all people.
You could find it out exactly if you could go around to all people, get their weight, add it all up, and divide by the number of people.
But that's too hard to do, so suppose you just pick 10 people at random and get their average weight, and pretend it's the same as the average of everybody.
That's called the sample mean, but you know it isn't accurate.
It has what is called a standard error, meaning it has uncertainty.
In fact, if you were to do that experiment many times over with different people, you would get a different sample mean every time, and those sample means would themselves form a bell-shaped distribution, the standard deviation of which would be called the standard error, representing its uncertainty.
In general, if you increased the number of people you look at by a factor of 100, you can reduce the standard error, the uncertainty, by a factor of 10.
I bet you can tell that people who take polls for a living care about this stuff very much.
EDIT for the downvoter: In case the downvote is because this doesn't look like a stackoverflow question or answer,
I've made a point of advocating the random pausing method of profiling.
Profiling in large part is perceived to be about measuring (statistically) the time that programming constructs are responsible for.
Often people are inhibited from using that method because they are afraid the results have too much uncertainty.
This post gets very specific about what that uncertainty actually is.
It shows that the bogey-man fear of uncertainty has the effect of preventing people from finding really substantial speedups in their code.
So naivete' about statistics is definitely a serious programming problem.
My view looks at a scenario using three different coloured balls:
I love some of the answers given here. My own view, based on my current research, is that these are two distinct terms. Uncertainty refers to not knowing in advance which ball could be selected when a person, for instance, is given a chance to select one ball from three different coloured balls.
This remains true when each ball has an equal chance of being selected i.e. equal probabilities. However, things soon get complex when each ball has it's own distinct probability. Chances are that the one with the highest probability will be selected. This seems especially true in algorithm development which would almost always select the highest probability compromising the meaning of randomness.
Having said all of this - I believe these concepts remain confusing which has just made me realise the time I need to dedicate on clearly distinguishing between the two to make sure my current research is not confusing. My own predicament is that I need to work on stochastic vs deterministic views. Based on the current view stochastic would be more uncertain than random whereas deterministic would be more probability based i.e. knowing for certain that the highest probability would be chosen; but this seems very far from the truth.
It seems as if uncertainty holds until just before a ball is selected/touched and soon looses its meaning as soon as the ball is picked which should result to its probability being revised. I personally think the terms have theoretical differences which perhaps allows them to be used interchangeably.
Uncertainty in math and science typically means there are a lack of facts, or the facts are unobtainable. Weather forecasting is a great example of uncertainty.
Randomness has many definitions. Commonly it's used in probability / statistics as a measure or quantification of uncertainty. So in my weather example, a 30% chance of rain is a measure of uncertainty. The more general definition (which also applies to math / science) is unpredictable, or lack of order.
There is definitely a fuzzy distinction between the two.
According to the Bayesian interpretation of probability, uncertainty and randomness are just two names for the same thing.
If an experiment is random, then it is uncertain to you. If something is uncertain to you, then it has the randomness property.

Creating a non-perfect game algorithm

I know how algorithms like minimax can be used in order to play perfect games (In this case, I'm looking a game similar to Tic-Tac-Toe)
However, I'm wondering how one would go about creating a non-perfect algorithm, or an AI at different 'skill levels' (Easy, Medium, Hard etc), that a human player would actually have a chance of defeating.
Cut off the search at various depths to limit the skill of the computer. Change the evaluation function to make the computer favor different strategies.
Non-expert human players play with sub-optimal strategies and limited tactics. These roughly correspond to poor evaluation of game states and limited ability to think ahead.
Regarding randomness, a little is desired so the computer doesn't always make the same mistakes and can sometimes luck into doing better or worse than usual. For this, just don't always choose the best path, but choose the among them weighted by their scores. You can make the AI even more interesting by having it refine its evaluation function, i.e. update its weightings, based on the results of the game. This way it can learn a better evaluation function at limited search depth through playing, just as a human might.
One way i use in my games is to utilize random value. For easy game levels, i let the odds of selecting a random number in the favor of the human player. Example:
Easy level: only beat the human if you can randomly select a value less than 10 from the range of 1 to 100
Medium level: beat the human if you can select a random value which is less than 50 from a range of 1 to 100
Hard level: beat the human if you can randomly select a value less than 90 from a range of 1 to 100
I am sure there are better ways, but this might give you an idea
the "simplest" way would be to use a threshold value along with your minmax results, creating a set from those results that exceed the threshold, then randomly select a choice/path for the program to take. the lower the threshold the possibly easier opponent.
i say possibly because even by pure dumb luck the best move could be selected, hence "Beginner's Luck".
essentially, you are looking to increase the entropy (randomness) of the possible outcomes. if you want to specifically dumb down the computer opponent, you could limit the levels your minmax algorithm traverses, or devalue the points for some portion of the algorithm.
It is not easy for a engine to make human mistakes. Reducing the search depth is a straightforward approach but it has its limits. For example, chess engines that are reduced to one ply often give check while one valuable piece is still attacked. When the opponent defends the check with a counterattack, both pieces are en prise. It is unlikely that even an inexperienced human falls for this mistake.
Maybe you can use some ideas from a chess engine called Phalanx:
http://phalanx.sourceforge.net/index.html
It is one of the few open source engine that has a sophisticated difficulty level (-e option). If I'm not mistaken, it performs a normal search but sometimes ignores non-obvious moves. evaluate.c contains a function called blunder, which evaluates whether a move is likely to be overlooked by a human.

Where do mathematical algorithms for Reddit's ranking, as an example, come from?

recently I was looking at Reddit's algorithm for determining what makes a post a "hot" topic and which content is suitable for the reddit homepage.
the article I was reading is here:
http://amix.dk/blog/post/19588
I've noticed they have mathematical logorithms and have created some kind of a mathematical function to determine the hotness/relevance of a post.
In the formulas used, where do each of the mathematical components come from and how do they know to use them?
thank you!
-- Bakz
EDIT: just to clarify, I just graduated high school and apologize if the answer to this question seems pretty obvious. thanks again!
I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly. I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines:
Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good).
If two stories are equally old, the one with more upvotes should be higher. (So there needs to be a contribution from the votes.)
The more upvotes a story gets, the longer it should remain near the top of the ranking.
Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes. Fairly soon (after a day or two), new stories need to outrank them. (So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets.)
Stories with more downvotes than upvotes should not appear in the rankings at all.
Now let's look at the formula: log z + yt / 45000 and see how it satisfies these requirements.
If the number of votes does not change, then z, y and t are all unchanged. So the score is unchanged. This satisfies requirement (1).
If two stories have the same age, then they have the same value for t. But the one with more upvotes has a higher value of z, and since log is monotonic, it has a higher score. This satisfies requirement (2).
The more upvotes a story has, the higher its z, so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3).
Logarithm is a function that grows more slowly as it gets larger (take a look at its graph). So a story needs more and more upvotes over time to keep up with newer stories. This satisfies requirement (4).
If the story has more downvotes than upvotes, then z = 1 and y = −1 so the score is negative. This satisfies requirement (5).
The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 86,400 seconds in a day, so t gets larger by this amount each day. Dividing t by 45,000 gives 1.92 which means that one day's relative newness is worth is 101.92 = 83 votes, and two days' relative newness are worth roughly 7,000 votes.
They don't come from anywhere. There is no absolute truth to them, nor anything to prove. It's simply a way to quantify an attribute in as most sensible a way as seemed to the development team.
You would use log when you want something to be a factor although a less important one (since large values indeed grow, although very slowly). But by the same token, they could have chosen cube root.
The formulae are simply a representation of those factors which we can presume are those which characteristically belong to something "hot", and a composition of them in such a manner that takes each into account in an appropriate proportion (for example, we'll square those values that have huge importance, and take log of those which are less).
Once they came up with the formula, they probably came up with 10 or 15 different types of posts and plugged the numbers in and saw that that made a lot of sense all round, so stuck with it. In fact, there first few attempts probably didn't come out so well, and after a little fiddling with the numbers arrived at that formula.

How to rank a million images with a crowdsourced sort

I'd like to rank a collection of landscape images by making a game whereby site visitors can rate them, in order to find out which images people find the most appealing.
What would be a good method of doing that?
Hot-or-Not style? I.e. show a single image, ask the user to rank it from 1-10. As I see it, this allows me to average the scores, and I would just need to ensure that I get an even distribution of votes across all the images. Fairly simple to implement.
Pick A-or-B? I.e. show two images, ask user to pick the better one. This is appealing as there is no numerical ranking, it's just a comparison. But how would I implement it? My first thought was to do it as a quicksort, with the comparison operations being provided by humans, and once completed, simply repeat the sort ad-infinitum.
How would you do it?
If you need numbers, I'm talking about one million images, on a site with 20,000 daily visits. I'd imagine a small proportion might play the game, for the sake of argument, lets say I can generate 2,000 human sort operations a day! It's a non-profit website, and the terminally curious will find it through my profile :)
As others have said, ranking 1-10 does not work that well because people have different levels.
The problem with the Pick A-or-B method is that its not guaranteed for the system to be transitive (A can beat B, but B beats C, and C beats A). Having nontransitive comparison operators breaks sorting algorithms. With quicksort, against this example, the letters not chosen as the pivot will be incorrectly ranked against each other.
At any given time, you want an absolute ranking of all the pictures (even if some/all of them are tied). You also want your ranking not to change unless someone votes.
I would use the Pick A-or-B (or tie) method, but determine ranking similar to the Elo ratings system which is used for rankings in 2 player games (originally chess):
The Elo player-rating
system compares players’ match records
against their opponents’ match records
and determines the probability of the
player winning the matchup. This
probability factor determines how many
points a players’ rating goes up or
down based on the results of each
match. When a player defeats an
opponent with a higher rating, the
player’s rating goes up more than if
he or she defeated a player with a
lower rating (since players should
defeat opponents who have lower
ratings).
The Elo System:
All new players start out with a base rating of 1600
WinProbability = 1/(10^(( Opponent’s Current Rating–Player’s Current Rating)/400) + 1)
ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a draw.
Player’s New Rating = Player’s Old Rating + (K-Value * (ScoringPt–Player’s Win Probability))
Replace "players" with pictures and you have a simple way of adjusting both pictures' rating based on a formula. You can then perform a ranking using those numeric scores. (K-Value here is the "Level" of the tournament. It's 8-16 for small local tournaments and 24-32 for larger invitationals/regionals. You can just use a constant like 20).
With this method, you only need to keep one number for each picture which is a lot less memory intensive than keeping the individual ranks of each picture to each other picture.
EDIT: Added a little more meat based on comments.
Most naive approaches to the problem have some serious issues. The worst is how bash.org and qdb.us displays quotes - users can vote a quote up (+1) or down (-1), and the list of best quotes is sorted by the total net score. This suffers from a horrible time bias - older quotes have accumulated huge numbers of positive votes via simple longevity even if they're only marginally humorous. This algorithm might make sense if jokes got funnier as they got older but - trust me - they don't.
There are various attempts to fix this - looking at the number of positive votes per time period, weighting more recent votes, implementing a decay system for older votes, calculating the ratio of positive to negative votes, etc. Most suffer from other flaws.
The best solution - I think - is the one that the websites The Funniest The Cutest, The Fairest, and Best Thing use - a modified Condorcet voting system:
The system gives each one a number based on, out of the things that it has faced, what percentage of them it usually beats. So each one gets the percentage score NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). Also, things are barred from the top list until they've been compared to a reasonable percentage of the set.
If there's a Condorcet winner in the set, this method will find it. Since that's unlikely, given the statistical nature, it finds the one that's the "closest" to being a Condorcet winner.
For more information on implementing such systems the Wikipedia page on Ranked Pairs should be helpful.
The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's a good thing. I believe it's very well accepted in decision theory that humans are vastly better at comparing two objects than they are at abstract ranking. Millions of years of evolution make us good at picking the best apple off the tree, but terrible at deciding how closely the apple we picked hews to the true Platonic Form of appleness. (This is, by the way, why the Analytic Hierarchy Process is so nifty...but that's getting a bit off topic.)
One final point to make is that SO uses an algorithm to find the best answers which is very similar to bash.org's algorithm to find the best quote. It works well here, but fails terribly there - in large part because an old, highly rated, but now outdated answer here is likely to be edited. bash.org doesn't allow editing, and it's not clear how you'd even go about editing decade-old jokes about now-dated internet memes even if you could... In any case, my point is that the right algorithm usually depends on the details of your problem. :-)
I know this question is quite old but I thought I'd contribute
I'd look at the TrueSkill system developed at Microsoft Research. It's like ELO but has a much faster convergence time (looks exponential compared to linear), so you get more out of each vote. It is, however, more complex mathematically.
http://en.wikipedia.org/wiki/TrueSkill
I don't like the Hot-or-Not style. Different people would pick different numbers even if they all liked the image exactly the same. Also I hate rating things out of 10, I never know which number to choose.
Pick A-or-B is much simpler and funner. You get to see two images, and comparisons are made between the images on the site.
These equations from Wikipedia makes it simpler/more effective to calculate Elo ratings, the algorithm for images A and B would be simple:
Get Ne, mA, mB and ratings RA,RB from your database.
Calculate KA ,KB, QA, QB by using the number of comparisons performed (Ne) and the number of times that image was compared (m) and current ratings :
Calculate EA and EB.
Score the winner's S : the winner as 1, loser as 0, and if you have a draw as 0.5,
Calculate the new ratings for both using:
Update the new ratings RA,RB and counts mA,mB in the database.
You may want to go with a combination.
First phase:
Hot-or-not style (although I would go with a 3 option vote: Sucks, Meh/OK. Cool!)
Once you've sorted the set into the 3 buckets, then I would select two images from the same bucket and go with the "Which is nicer"
You could then use an English Soccer system of promotion and demotion to move the top few "Sucks" into the Meh/OK region, in order to refine the edge cases.
Ranking 1-10 won't work, everyone has different levels. Someone who always gives 3-7 ratings would have his rankings eclipsed by people who always give 1 or 10.
a-or-b is more workable.
Wow, I'm late in the game.
I like the ELO system very much so, but like Owen says it seems to me that you'd be slow building up any significant results.
I believe humans have much greater capacity than just comparing two images, but you want to keep interactions to the bare minimum.
So how about you show n images (n being any number you can visibly display on a screen, this may be 10, 20, 30 depending on user's preference maybe) and get them to pick which they think is best in that lot. Now back to ELO. You need to modify you ratings system, but keep the same spirit. You have in fact compared one image to n-1 others. So you do your ELO rating n-1 times, but you should divide the change of rating by n-1 to match (so that results with different values of n are coherent with one another).
You're done. You've now got the best of all worlds. A simple rating system working with many images in one click.
If you prefer using the Pick A or B strategy I would recommend this paper: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf
Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013,
February). Pairwise ranking aggregation in a crowdsourced setting. In
Proceedings of the sixth ACM international conference on Web search
and data mining (pp. 193-202). ACM.
The paper tells about the Crowd-BT model which extends the famous Bradley-Terry pairwise comparison model into crowdsource setting. It also gives an adaptive learning algorithm to enhance the time and space efficiency of the model. You can find a Matlab implementation of the algorithm on Github (but I'm not sure if it works).
The defunct web site whatsbetter.com used an Elo style method. You can read about the method in their FAQ on the Internet Archive.
Pick A-or-B its the simplest and less prone to bias, however at each human interaction it gives you substantially less information. I think because of the bias reduction, Pick is superior and in the limit it provides you with the same information.
A very simple scoring scheme is to have a count for each picture. When someone gives a positive comparison increment the count, when someone gives a negative comparison, decrement the count.
Sorting a 1-million integer list is very quick and will take less than a second on a modern computer.
That said, the problem is rather ill-posed - It will take you 50 days to show each image only once.
I bet though you are more interested in the most highly ranked images? So, you probably want to bias your image retrieval by predicted rank - so you are more likely to show images that have already achieved a few positive comparisons. This way you will more quickly just start showing 'interesting' images.
I like the quick-sort option but I'd make a few tweeks:
Keep the "comparison" results in a DB and then average them.
Get more than one comparison per view by giving the user 4-6 images and having them sort them.
Select what images to display by running qsort and recording and trimming anything that you don't have enough data on. Then when you have enough items recorded, spit out a page.
The other fun option would be to use the crowd to teach a neural-net.

Resources