Search Engines Inexact Counting (about xxx results)

Search Engines Inexact Counting (about xxx results) - algorithm

When you search in Google (i'm almost sure that Altavista did the same thing) it says "Results 1-10 of about xxxx"...
This has always amazed me... What does it mean "about"?
How can they count roughly?
I do understand why they can't come up with a precise figure in a reasonable time, but how do they even reach this "approximate" one?
I'm sure there's a lot of theory behind this one that I missed...

Most likely it's similar to the sort of estimated row counts used by most SQL systems in their query planning; a number of rows in the table (known exactly as of the last time statistics were collected, but generally not up-to-date), multiplied by an estimated selectivity (usually based on a sort of statistical distribution model calculated by sampling some small subset of rows).
The PostgreSQL manual has a section on statistics used by the planner that is fairly informative, at least if you follow the links out to pg_stats and various other sections. I'm sure that doesn't really describe what google does, but it at least shows one model where you could get the first N rows and an estimate of how many more there might be.

Not relevant to your question, but reminds of a little joke a friend of mine made when doing a simple ego-search (and don't tell me you've never Googled your name). He said something like
"Wow, about 5,000 results in just 0.22 seconds! Now, imagine how many results this is in one minute, one hour, one day!"

I imagine the estimate is based on statistics. They aren't going to count all of the relevant page matches, so what they (I would) do is work out roughly what percentage of pages would match the query, based on some heuristic, and then use that as the basis for the count.
One heuristic might be to do a sample count - take a random sample of 1000 or so pages and see what percentage matched. It wouldn't take too many in the sample to get a statisically significant answer.

One thing that hasn't been mentioned yet is deduplication. Some search engines (I'm not sure exactly how Google in particular does it) will use heuristics to try and decide if two different URLs contain the same (or extremely similar) content, and are thus duplicate results.
If there are 156 unique URLs, but 9 of those have been marked as duplicates of other results, it is simpler to say "about 150 results" rather than something like "156 results which contains 147 unique results and 9 duplicates".

Returning an exact number of results is not worth the overhead to accurately calculate. Since there's not much of a value add from knowing there was 1,004,345 results rather than 'about 1,000,000', it's more important from an end user experience perspective to return the results faster rather than the additional time to calculate the total.
From Google themselves:
"Google's calculation of the total number of search results is an estimate. We understand that a ballpark figure is valuable, and by providing an estimate rather than an exact account, we can return quality search results faster."

Related

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.

The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?

There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.

One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.

Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD

As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)

Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm

This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.

Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.

There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.

I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

Where do mathematical algorithms for Reddit's ranking, as an example, come from?

recently I was looking at Reddit's algorithm for determining what makes a post a "hot" topic and which content is suitable for the reddit homepage.
the article I was reading is here:
http://amix.dk/blog/post/19588
I've noticed they have mathematical logorithms and have created some kind of a mathematical function to determine the hotness/relevance of a post.
In the formulas used, where do each of the mathematical components come from and how do they know to use them?
thank you!
-- Bakz
EDIT: just to clarify, I just graduated high school and apologize if the answer to this question seems pretty obvious. thanks again!

I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly. I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines:
Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good).
If two stories are equally old, the one with more upvotes should be higher. (So there needs to be a contribution from the votes.)
The more upvotes a story gets, the longer it should remain near the top of the ranking.
Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes. Fairly soon (after a day or two), new stories need to outrank them. (So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets.)
Stories with more downvotes than upvotes should not appear in the rankings at all.
Now let's look at the formula: log z + yt / 45000 and see how it satisfies these requirements.
If the number of votes does not change, then z, y and t are all unchanged. So the score is unchanged. This satisfies requirement (1).
If two stories have the same age, then they have the same value for t. But the one with more upvotes has a higher value of z, and since log is monotonic, it has a higher score. This satisfies requirement (2).
The more upvotes a story has, the higher its z, so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3).
Logarithm is a function that grows more slowly as it gets larger (take a look at its graph). So a story needs more and more upvotes over time to keep up with newer stories. This satisfies requirement (4).
If the story has more downvotes than upvotes, then z = 1 and y = −1 so the score is negative. This satisfies requirement (5).
The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 86,400 seconds in a day, so t gets larger by this amount each day. Dividing t by 45,000 gives 1.92 which means that one day's relative newness is worth is 101.92 = 83 votes, and two days' relative newness are worth roughly 7,000 votes.

They don't come from anywhere. There is no absolute truth to them, nor anything to prove. It's simply a way to quantify an attribute in as most sensible a way as seemed to the development team.
You would use log when you want something to be a factor although a less important one (since large values indeed grow, although very slowly). But by the same token, they could have chosen cube root.
The formulae are simply a representation of those factors which we can presume are those which characteristically belong to something "hot", and a composition of them in such a manner that takes each into account in an appropriate proportion (for example, we'll square those values that have huge importance, and take log of those which are less).
Once they came up with the formula, they probably came up with 10 or 15 different types of posts and plugged the numbers in and saw that that made a lot of sense all round, so stuck with it. In fact, there first few attempts probably didn't come out so well, and after a little fiddling with the numbers arrived at that formula.

How to check user choice algorithm

I have an algorithm that chooses a list of items that should fit the user's likings.
I'll skip the algorithm's details because of confidentiality issues...
Now, I'm trying to think of a way to check it statistically, with a group of people.
The way I'm checking it now is:
Algorithm gets best results per user.
shuffle top 5 results with lowest 5 results.
make person list the results he liked by order (0 = liked best, 9 = didn't like)
compare user results to algorithm results.
I'm doing this because i figured that to show that algorithm chooses good results, i need to put in some bad results and show that the algorithm knows its a bad result as well.
So, what I'm asking is:
Is shuffling top results with low results is a good idea ?
And if not, do you have an idea on how to get good statistics on how good an algorithm matches user preferences (we have users that can choose stuff) ?

First ask yourself:
What am I trying to measure?
Not to rag on the other submissions here, but while mjv and Sjoerd's answers offer some plausible heuristic reasons for why what you are trying to do may not work as you expect; they are not constructive in the sense that they do not explain why your experiment is flawed, and what you can do to improve it. Before either of these issues can be addressed, what you need to do is define what you hope to measure, and only then should you go about trying to devise an experiment.
Now, I can't say for certain what would constitute a good metric for your purposes, but I can offer you some suggestions. As a starting point, you could try using a precision vs. recall graph:
http://en.wikipedia.org/wiki/Precision_and_recall
This is a standard technique for assessing the performance of ranking and classification algorithms in machine learning and information retrieval (ie web searching). If you have an engineering background, it could be helpful to understand that precision/recall generalizes the notion of precision/accuracy:
http://en.wikipedia.org/wiki/Accuracy_and_precision
Now let us suppose that your algorithm does something like this; it takes as input some prior data about a user then returns a ranked list of other items that user might like. For example, your algorithm is a web search engine and the items are pages; or you have a movie recommender and the items are books. This sounds pretty close to what you are trying to do now, so let us continue with this analogy.
Then the precision of your algorithm's results on the first n is the number of items that the user actually liked out of your first to top n recommendations:
precision = #(items user actually liked out of top n) / n
And the recall is the number of items that you actually got right out of the total number of items:
recall = #(items correctly marked as liked) / #(items user actually likes)
Ideally, one would want to maximize both of these quantities, but they are in a certain sense competing objectives. To illustrate this, consider a few extremal situations: For example, you could have a recommender that returns everything, which would have perfect recall, but very low precision. A second possibility is to have a recommender that returns nothing or only one sure-fire hit, which would have (in a limiting sense) perfect precision, but almost no recall.
As a result, to understand the performance of a ranking algorithm, people typically look at its precision vs. recall graph. These are just plots of the precision vs the recall as the number of items returned are varied:
Image taken from the following tutorial (which is worth reading):
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
Now to approximate a precision vs recall for your algorithm, here is what you can do. First, return a large set of say n, results as ranked by your algorithm. Next, get the user to mark which items they actually liked out of those n results. This trivially gives us enough information to compute the precision at every partial set of documents < n (since we know the number). We can also compute the recall (as restricted to this set of documents) by taking the total number of items liked by the user in the entire set. This, we can plot a precision recall curve for this data. Now there are fancier statistical techniques for estimating this using less work, but I have already written enough. For more information please check out the links in the body of my answer.

Your method is biased. If you use the top 5 and bottom 5 results, It is very likely that the user orders it according to your algorithm. Let's say we have an algorithm which rates music, and I present the top 1 and bottom 1 to the user:
Queen
The Cheeky Girls
Of course the user will mark it exactly like your algorithm, because the difference between the top and bottom is so big. You need to make the user rate randomly selected items.

Independently of the question of mixing top and bottom guesses, an implicit drawback of the experimental process, as described, is that the data related to the user's choice can only be exploited in the context of one particular version of the algorithm:
When / if the algorithm or its parameters are ever slightly tuned, the record of past user's choices cannot be reused to validate the changes to the algorithm.
On mixing high and low results:
The main drawback of producing sets of items by mixing the algorithm's top and bottom guesses is that it may further complicate the choice of the error/distance function used to measure how well the algorithm performed. Unless the two subsets of items (topmost choices, bottom most choices) are kept separately for the purpose of computing distinct measurements, typical statistical measures of the error (say RMSE) will not be a good measurement of the effective algorithm's quality.
For example, an algorithm which frequently suggests, low guesses items which end up being picked as top choices by the user may have the same averaged error rate than an algorithm which never confuses highs with lows, but where there the user tends to reorders the items more within their subset.
A second drawback is that the algorithm evaluation method may merely qualify its ability of filtering the relative like/dislike of users for items it [the algorithm] chooses rather than its ability of producing the user's actual top choices.
In other words the user's actual top choices may never be offered to him; so yeah the algorithm does a good job at guessing that user will like say Rock-and-Roll before Rap, but never guessing that in fact user prefers Classical Baroque music over all.

algorithms to evaluate user responses

I'm working on a web application which will be used for classifying photos of automobiles. The users will be presented with photos of various vehicles, and will be asked to answer a series of questions about what they see. The results will be recorded to a database, averaged, and displayed.
I'm looking for algorithms to help me identify users which frequently don't vote with the group, indicating that they're probably either not paying attention to the photos, or that they're lying about what they see. I then want to exclude these users, and recalculate the results, such that I can say, with a known amount of confidence, that this particular photo shows a vehicle that is this and that.
This question goes out to all you computer science guys, where to find such algorithms or to give myself the theoretical background to design such algorithms. I'm assuming I'm going to have to learn some probability and statics, maybe some data mining. Some book recommendations would be great. Thanks!
P.S. These are multiple choice questions.
All of these are good suggestions. Thank you! I wish there was a way on stack overflow to select multiple correct answers so more of you could be acknowledged for your contributions!!

Read The Elements of Statistical Learning, it is a great compendium on data mining.
You can be interested especially in unsupervised algorithms, for example clustering. Assuming that most people do not lie, the biggest cluster is right and the rest is wrong. Mark people accordingly, then apply some bayesian statistics and you'll be done.
Of course, most data mining technologies are pretty experimentative, so don't count on that they will be always right... or even in most cases.

I believe what you described is solved using outlier/anomaly detection.
A number of techniques exist:
statistical-based methods
distance-based methods
model-based methods
I suggest you take a look at these slides from the excellent book Introduction to Data Mining

If you know what answers you are expecting why do you ask people to vote? By excluding some values you basically turn the vote in something that you like. Automobiles make different impression to different individuals. If 100 ppl loved a car then when someone comes and says that he/she doesn't like it, you exclude the vote?
But anyway, considering that you still want to do this, first of all you will need a large set o data from "trusted" voters. This will give you an idea of "good" answer and from this point you can choose the exclude threshold.
Without an initial set of data you cannot apply any algorithm because you will get false results. Consider just one vote of 100 from on a scale from 0 to 100. The second vote is "1" The you will exclude this vote because is too far away from the average.

I think a pretty simple algorithm could accomplish this for you. You could try and get fancier by calculating the standard deviations and such, but I wouldn't bother.
Here's a simple approach that should be sufficient:
For each of your users, calculate the number of questions they answered and the number of times they selected the most popular answer for the question. The users which have the lowest ratio of picking the popular answer versus total answers you can guess are providing bogus data.
You probably would not want to throw out the data from users where they've only answered a small number of questions because they likely have just disagreed on a few versus putting in bogus data.

What kind of questions are they (Yes/No, or 1 to 10?).
You may be able to get away with not discarding anything by using a mean instead of an average. With averages if there are extreme outliers in the response it could affect the average, but if you use median you may get a better answer. So for example if you had 5 answers, order them and pick the middle one.

I think what you are saying is that you are concerned that certain people are "outliers", and they are adding noise to your data, making the categorizations less reliable. So, if you have a Chevy Camaro, and most people say it is either a pony car, a muscle car, or a sports car, but you have some goofball who says it's a family sedan, you would want to minimize the impact of his vote.
One thing you could do is provide a Stack Overflow-like reputation score for users:
The more a user is "in agreement" with other users, the better his or her score would be. For a given user (User X), this could be determined by a simple calculation of what percentage of users who responded to a question chose the same category as User X, then averaging this value over all questions answered.
You may want to to multiply this value by the total number of question answered to encourage people to answer as many questions as possible. (Note: if you choose to do this, it would be equivalent to just summing the percentage agreement scores rather than averaging them.)
You could present the final reputation score to users, making sure to explain that they will be rewarded for how well their responses agree with those of other users. This will encourage people to answer more questions but also to take care in their answers.
Finally, you could calculate a certainty score for a given categorization by adding up the total reputation score of all people who chose a given category.
Some of these ideas may need some refinement, especially since I don't know your exact situation. Certainly, if people can see what other people chose before they vote, it would be way too easy to game the system.

If you were to collect votes like "on a scale from 1 to 10, how would you rate this car", you could probably use simple average and standard deviation: the smaller the standard deviation, the more unanimous the general consensus is among your voters, and you can flag users who are e.g. 3 standard devs from the average.
For multiple choice, you need to be more careful. Simply discarding all but the most-voted option will do nothing but disgruntle the voters. You need to establish a measure of how significant the winner is w.r.t. the other options, e.g. flag users who voted for options with less than 1/3 of the winning options count.
Note that I wrote "flag users", not discard votes. If you discard votes, you can't tell how confident you are about the result ("91% voted this to be a Ford Mustang"). If a user has more than a certain percentage of his votes flagged - well, that's up to you.
Your trickiest problem, however, will probably be to collect sufficient votes. Depending on how easy the multiple choice problem is, you probably need several times the number of options as votes, per photo. Otherwise the statistics are meaningless.

How to rank a million images with a crowdsourced sort

I'd like to rank a collection of landscape images by making a game whereby site visitors can rate them, in order to find out which images people find the most appealing.
What would be a good method of doing that?
Hot-or-Not style? I.e. show a single image, ask the user to rank it from 1-10. As I see it, this allows me to average the scores, and I would just need to ensure that I get an even distribution of votes across all the images. Fairly simple to implement.
Pick A-or-B? I.e. show two images, ask user to pick the better one. This is appealing as there is no numerical ranking, it's just a comparison. But how would I implement it? My first thought was to do it as a quicksort, with the comparison operations being provided by humans, and once completed, simply repeat the sort ad-infinitum.
How would you do it?
If you need numbers, I'm talking about one million images, on a site with 20,000 daily visits. I'd imagine a small proportion might play the game, for the sake of argument, lets say I can generate 2,000 human sort operations a day! It's a non-profit website, and the terminally curious will find it through my profile :)

As others have said, ranking 1-10 does not work that well because people have different levels.
The problem with the Pick A-or-B method is that its not guaranteed for the system to be transitive (A can beat B, but B beats C, and C beats A). Having nontransitive comparison operators breaks sorting algorithms. With quicksort, against this example, the letters not chosen as the pivot will be incorrectly ranked against each other.
At any given time, you want an absolute ranking of all the pictures (even if some/all of them are tied). You also want your ranking not to change unless someone votes.
I would use the Pick A-or-B (or tie) method, but determine ranking similar to the Elo ratings system which is used for rankings in 2 player games (originally chess):
The Elo player-rating
system compares players’ match records
against their opponents’ match records
and determines the probability of the
player winning the matchup. This
probability factor determines how many
points a players’ rating goes up or
down based on the results of each
match. When a player defeats an
opponent with a higher rating, the
player’s rating goes up more than if
he or she defeated a player with a
lower rating (since players should
defeat opponents who have lower
ratings).
The Elo System:
All new players start out with a base rating of 1600
WinProbability = 1/(10^(( Opponent’s Current Rating–Player’s Current Rating)/400) + 1)
ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a draw.
Player’s New Rating = Player’s Old Rating + (K-Value * (ScoringPt–Player’s Win Probability))
Replace "players" with pictures and you have a simple way of adjusting both pictures' rating based on a formula. You can then perform a ranking using those numeric scores. (K-Value here is the "Level" of the tournament. It's 8-16 for small local tournaments and 24-32 for larger invitationals/regionals. You can just use a constant like 20).
With this method, you only need to keep one number for each picture which is a lot less memory intensive than keeping the individual ranks of each picture to each other picture.
EDIT: Added a little more meat based on comments.

Most naive approaches to the problem have some serious issues. The worst is how bash.org and qdb.us displays quotes - users can vote a quote up (+1) or down (-1), and the list of best quotes is sorted by the total net score. This suffers from a horrible time bias - older quotes have accumulated huge numbers of positive votes via simple longevity even if they're only marginally humorous. This algorithm might make sense if jokes got funnier as they got older but - trust me - they don't.
There are various attempts to fix this - looking at the number of positive votes per time period, weighting more recent votes, implementing a decay system for older votes, calculating the ratio of positive to negative votes, etc. Most suffer from other flaws.
The best solution - I think - is the one that the websites The Funniest The Cutest, The Fairest, and Best Thing use - a modified Condorcet voting system:
The system gives each one a number based on, out of the things that it has faced, what percentage of them it usually beats. So each one gets the percentage score NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). Also, things are barred from the top list until they've been compared to a reasonable percentage of the set.
If there's a Condorcet winner in the set, this method will find it. Since that's unlikely, given the statistical nature, it finds the one that's the "closest" to being a Condorcet winner.
For more information on implementing such systems the Wikipedia page on Ranked Pairs should be helpful.
The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's a good thing. I believe it's very well accepted in decision theory that humans are vastly better at comparing two objects than they are at abstract ranking. Millions of years of evolution make us good at picking the best apple off the tree, but terrible at deciding how closely the apple we picked hews to the true Platonic Form of appleness. (This is, by the way, why the Analytic Hierarchy Process is so nifty...but that's getting a bit off topic.)
One final point to make is that SO uses an algorithm to find the best answers which is very similar to bash.org's algorithm to find the best quote. It works well here, but fails terribly there - in large part because an old, highly rated, but now outdated answer here is likely to be edited. bash.org doesn't allow editing, and it's not clear how you'd even go about editing decade-old jokes about now-dated internet memes even if you could... In any case, my point is that the right algorithm usually depends on the details of your problem. :-)

I know this question is quite old but I thought I'd contribute
I'd look at the TrueSkill system developed at Microsoft Research. It's like ELO but has a much faster convergence time (looks exponential compared to linear), so you get more out of each vote. It is, however, more complex mathematically.
http://en.wikipedia.org/wiki/TrueSkill

I don't like the Hot-or-Not style. Different people would pick different numbers even if they all liked the image exactly the same. Also I hate rating things out of 10, I never know which number to choose.
Pick A-or-B is much simpler and funner. You get to see two images, and comparisons are made between the images on the site.

These equations from Wikipedia makes it simpler/more effective to calculate Elo ratings, the algorithm for images A and B would be simple:
Get Ne, mA, mB and ratings RA,RB from your database.
Calculate KA ,KB, QA, QB by using the number of comparisons performed (Ne) and the number of times that image was compared (m) and current ratings :
Calculate EA and EB.
Score the winner's S : the winner as 1, loser as 0, and if you have a draw as 0.5,
Calculate the new ratings for both using:
Update the new ratings RA,RB and counts mA,mB in the database.

You may want to go with a combination.
First phase:
Hot-or-not style (although I would go with a 3 option vote: Sucks, Meh/OK. Cool!)
Once you've sorted the set into the 3 buckets, then I would select two images from the same bucket and go with the "Which is nicer"
You could then use an English Soccer system of promotion and demotion to move the top few "Sucks" into the Meh/OK region, in order to refine the edge cases.

Ranking 1-10 won't work, everyone has different levels. Someone who always gives 3-7 ratings would have his rankings eclipsed by people who always give 1 or 10.
a-or-b is more workable.

Wow, I'm late in the game.
I like the ELO system very much so, but like Owen says it seems to me that you'd be slow building up any significant results.
I believe humans have much greater capacity than just comparing two images, but you want to keep interactions to the bare minimum.
So how about you show n images (n being any number you can visibly display on a screen, this may be 10, 20, 30 depending on user's preference maybe) and get them to pick which they think is best in that lot. Now back to ELO. You need to modify you ratings system, but keep the same spirit. You have in fact compared one image to n-1 others. So you do your ELO rating n-1 times, but you should divide the change of rating by n-1 to match (so that results with different values of n are coherent with one another).
You're done. You've now got the best of all worlds. A simple rating system working with many images in one click.

If you prefer using the Pick A or B strategy I would recommend this paper: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf
Chen, X., Bennett, P. N., Collins-Thompson, K., & Horvitz, E. (2013,
February). Pairwise ranking aggregation in a crowdsourced setting. In
Proceedings of the sixth ACM international conference on Web search
and data mining (pp. 193-202). ACM.
The paper tells about the Crowd-BT model which extends the famous Bradley-Terry pairwise comparison model into crowdsource setting. It also gives an adaptive learning algorithm to enhance the time and space efficiency of the model. You can find a Matlab implementation of the algorithm on Github (but I'm not sure if it works).

The defunct web site whatsbetter.com used an Elo style method. You can read about the method in their FAQ on the Internet Archive.

Pick A-or-B its the simplest and less prone to bias, however at each human interaction it gives you substantially less information. I think because of the bias reduction, Pick is superior and in the limit it provides you with the same information.
A very simple scoring scheme is to have a count for each picture. When someone gives a positive comparison increment the count, when someone gives a negative comparison, decrement the count.
Sorting a 1-million integer list is very quick and will take less than a second on a modern computer.
That said, the problem is rather ill-posed - It will take you 50 days to show each image only once.
I bet though you are more interested in the most highly ranked images? So, you probably want to bias your image retrieval by predicted rank - so you are more likely to show images that have already achieved a few positive comparisons. This way you will more quickly just start showing 'interesting' images.

I like the quick-sort option but I'd make a few tweeks:
Keep the "comparison" results in a DB and then average them.
Get more than one comparison per view by giving the user 4-6 images and having them sort them.
Select what images to display by running qsort and recording and trimming anything that you don't have enough data on. Then when you have enough items recorded, spit out a page.
The other fun option would be to use the crowd to teach a neural-net.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio