Finding the average of large list of numbers - algorithm

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?

If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.

First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

Related

Algorithmic help needed (N bags and items distributed randomly)

I have encountered an algorithmic problem but am not able to figure out anything better than brute force or reduce it to a better know problem. Any hints?
There are N bags of variable sizes and N types of items. Each type of items belongs to one bag. There are lots of items of each type and each item may be of a different size. Initially, these items are distributed across all the bags randomly. We have to place the items in their respective bags. However, we can only operate with a pair of bags at one time by exchanging items (as much as possible) and proceeding to the next pair. The aim is to reduce the total number of pairs. Edit: The aim is to find a sequence of transfers that minimizes the total number of bag pairs involved
Clarification:
The bags are not arbitrarily large (You can assume the bag and item sizes to be integers between 0 to 1000 if it helps). You'll frequently encounter scenarios where the all the items between 2 bags cannot be swapped due to the limited capacity of one of the bags. This is where the algorithm needs to make an optimisation. Perhaps, if another pair of bags were swapped first, the current swap can be done in one go. To illustrate this, let's consider Bags A, B and C and their items 1, 2, 3 respectively. The number in the brackets is the size.
A(10) : 3(8)
B(10): 1(2), 1(3)
C(10): 1(4)
The swap orders can be AB, AC, AB or AC, AB. The latter is optimal as the number of swaps is lesser.
Since I cannot come to an idea for an algorithm that will always find an optimal answer, and approximation of the fitness of the solution (amount of swaps) is also fine, I suggest a stochastic local search algorithm with pruning.
Given a random starting configuration, this algorithm considers all possible swaps, and makes a weighed decision based on chance: the better a swap is, the more likely it is chosen.
The value of a swap would be the sum of the value of the transaction of an item, which is zero if the item does not end up in it's belonging bag, and is positive if it does end up there. The value increases as the item's size increases (the idea behind this is that a larger block is hard to move many times in comparison to smaller blocks). This fitness function can be replaced by any other fitness function, it's efficiency is unknown until empirically shown.
Since any configuration can be the consequence of many preceding swaps, we keep track of which configurations we have seen before, along with a fitness (based on how many items are in their correct bag - this fitness is not related to the value of a swap) and the list of preceded swaps. If the fitness function for a configuration is the sum of the items that are in their correct bags, then the amount of items in the problem is the highest fitness (and therefor marks a configuration to be a solution).
A swap is not possible if:
Either of the affected bags is holding more than it's capacity after the potential swap.
The new swap brings you back to the last configuration you were in before the last swap you did (i.e. reversed swap).
When we identify potential swaps, we look into our list of previously seen configurations (use a hash function for O(1) lookup). Then we either set its preceded swaps to our preceded swaps (if our list is shorter than it's), or we set our preceded swaps to its list (if it's list is shorter than ours). We can do this because it does not matter which swaps we did, as long as the amount of swaps is as small as possible.
If there are no more possible swaps left in a configuration, it means you're stuck. Local search tells you 'reset' which you can do in may ways, for instance:
Reset to a previously seen state (maybe the best one you've seen so far?)
Reset to a new valid random solution
Note
Since the algorithm only allows you to do valid swaps, all constraints will be met for each configuration.
The algorithm does not guarantee to 'stop' out of the box, you can implement a maximum number of iterations (swaps)
The algorithm does not guarantee to find a correct solution, as it does it's best to find a better configuration each iteration. However, since a perfect solution (set of swaps) should look closely to an almost perfect solution, a human might be able to finish what the local search algorithm was not after it results in a invalid configuration (where not every item is in its correct bag).
The used fitness functions and strategies are very likely not the most efficient out there. You could look around to find better ones. A more efficient fitness function / strategy should result in a good solution faster (less iterations).

Sorting in Computer Science vs. sorting in the 'real' world

I was thinking about sorting algorithms in software, and possible ways one could surmount the O(nlogn) roadblock. I don't think it IS possible to sort faster in a practical sense, so please don't think that I do.
With that said, it seems with almost all sorting algorithms, the software must know the position of each element. Which makes sense, otherwise, how would it know where to place each element according to some sorting criteria?
But when I crossed this thinking with the real world, a centrifuge has no idea what position each molecule is in when it 'sorts' the molecules by density. In fact, it doesn't care about the position of each molecule. However it can sort trillions upon trillions of items in a relatively short period of time, due to the fact that each molecule follows density and gravitational laws - which got me thinking.
Would it be possible with some overhead on each node (some value or method tacked on to each of the nodes) to 'force' the order of the list? Something like a centrifuge, where only each element cares about its relative position in space (in relation to other nodes). Or, does this violate some rule in computation?
I think one of the big points brought up here is the quantum mechanical effects of nature and how they apply in parallel to all particles simultaneously.
Perhaps classical computers inherently restrict sorting to the domain of O(nlogn), where as quantum computers may be able to cross that threshold into O(logn) algorithms that act in parallel.
The point that a centrifuge being basically a parallel bubble sort seems to be correct, which has a time complexity of O(n).
I guess the next thought is that if nature can sort in O(n), why can't computers?
EDIT: I had misunderstood the mechanism of a centrifuge and it appears that it does a comparison, a massively-parallel one at that. However there are physical processes that operate on a property of the entity being sorted rather than comparing two properties. This answer covers algorithms that are of that nature.
A centrifuge applies a sorting mechanism that doesn't really work by means of comparisons between elements, but actually by a property ('centrifugal force') on each individual element in isolation.Some sorting algorithms fall into this theme, especially Radix Sort. When this sorting algorithm is parallelized it should approach the example of a centrifuge.
Some other non-comparative sorting algorithms are Bucket sort and Counting Sort. You may find that Bucket sort also fits into the general idea of a centrifuge (the radius could correspond to a bin).
Another so-called 'sorting algorithm' where each element is considered in isolation is the Sleep Sort. Here time rather than the centrifugal force acts as the magnitude used for sorting.
Computational complexity is always defined with respect to some computational model. For example, an algorithm that's O(n) on a typical computer might be O(2n) if implemented in Brainfuck.
The centrifuge computational model has some interesting properties; for example:
it supports arbitrary parallelism; no matter how many particles are in the solution, they can all be sorted simultaneously.
it doesn't give a strict linear sort of particles by mass, but rather a very close (low-energy) approximation.
it's not feasible to examine the individual particles in the result.
it's not possible to sort particles by different properties; only mass is supported.
Given that we don't have the ability to implement something like this in general-purpose computing hardware, the model may not have practical relevance; but it can still be worth examining, to see if there's anything to be learned from it. Nondeterministic algorithms and quantum algorithms have both been active areas of research, for example, even though neither is actually implementable today.
The trick is there, that you only have a probability of sorting your list using a centrifuge. As with other real-world sorts [citation needed], you can change the probability that your have sorted your list, but never be certain without checking all the values (atoms).
Consider the question: "How long should you run your centrifuge for?"
If you only ran it for a picosecond, your sample may be less sorted than the initial state.. or if you ran it for a few days, it may be completely sorted. However, you wouldn't know without actually checking the contents.
A real world example of a computer based "ordering" would be autonomous drones that cooperatively work with each other, known as "drone swarms". The drones act and communicate both as individuals and as a group, and can track multiple targets. The drones collectively decide which drones will follow which targets and the obvious need to avoid collisions between drones. The early versions of this were drones that moved through way points while staying in formation, but the formation could change.
For a "sort", the drones could be programmed to form a line or pattern in a specific order, initially released in any permutation or shape, and collectively and in parallel they would quickly form the ordered line or pattern.
Getting back to a computer based sort, one issue is that there's one main memory bus, and there's no way for a large number of objects to move about in memory in parallel.
know the position of each element
In the case of a tape sort, the position of each element (record) is only "known" to the "tape", not to the computer. A tape based sort only needs to work with two elements at a time, and a way to denote run boundaries on a tape (file mark, or a record of different size).
IMHO, people overthink log(n). O(nlog(n)) IS practically O(n). And you need O(n) just to read the data.
Many algorithms such as quicksort do provide a very fast way to sort elements. You could implement variations of quicksort that would be very fast in practice.
Inherently all physical systems are infinitely parallel. You might have a buttload of atoms in a grain of sand, nature has enough computational power to figure out where each electron in each atom should be. So if you had enough computational resources (O(n) processors) you could sort n numbers in log(n) time.
From comments:
Given a physical processor that has k number of elements, it can achieve a parallelness of at most O(k). If you process n numbers arbitrarily, it would still process it at a rate related to k. Also, you could formulate this problem physically. You could create n steel balls with weights proportional to the number you want to encode, which could be solved by a centrifuge in a theory. But here the amount of atoms you are using is proportional to n. Whereas in a standard case you have a limited number of atoms in a processor.
Another way to think about this is, say you have a small processor attached to each number and each processor can communicate with its neighbors, you could sort all those numbers in O(log(n)) time.
I worked in an office summers after high school when I started college. I had studied in AP Computer Science, among other things, sorting and searching.
I applied this knowledge in several physical systems that I can recall:
Natural merge sort to start…
A system printed multipart forms including a file-card-sized tear off, which needed to be filed in a bank of drawers.
I started with a pile of them and sorted the pile to begin with. The first step is picking up 5 or so, few enough to be easily placed in order in your hand. Place the sorted packet down, criss-crossing each stack to keep them separate.
Then, merge each pair of stacks, producing a larger stack. Repeat until there is only one stack.
…Insertion sort to complete
It is easier to file the sorted cards, as each next one is a little farther down the same open drawer.
Radix sort
This one nobody else understood how I did it so fast, despite repeated tries to teach it.
A large box of check stubs (the size of punch cards) needs to be sorted. It looks like playing solitaire on a large table—deal out, stack up, repeat.
In general
30 years ago, I did notice what you’re asking about: the ideas transfer to physical systems quite directly because there are relative costs of comparisons and handling records, and levels of caching.
Going beyond well-understood equivalents
I recall an essay about your topic, and it brought up the spaghetti sort. You trim a length of dried noodle to indicate the key value, and label it with the record ID. This is O(n), simply processing each item once.
Then you grab the bundle and tap one end on the table. They align on the bottom edges, and they are now sorted. You can trivially take off the longest one, and repeat. The read-out is also O(n).
There are two things going on here in the “real world” that don’t correspond to algorithms. First, aligning the edges is a parallel operation. Every data item is also a processor (the laws of physics apply to it). So, in general, you scale the available processing with n, essentially dividing your classic complexity by a factor on n.
Second, how does aligning the edges accomplish a sort? The real sorting is in the read-out which lets you find the longest in one step, even though you did compare all of them to find the longest. Again, divide by a factor of n, so finding the largest is now O(1).
Another example is using analog computing: a physical model solves the problem “instantly” and the prep work is O(n). In principle the computation is scaling with the number of interacting components, not the number of prepped items. So the computation scales with n². The example I'm thinking of is a weighted multi-factor computation, which was done by drilling holes in a map, hanging weights from strings passing through the holes, and gathering all the strings on a ring.
Sorting is still O(n) total time. That it is faster than that is because of Parallelization.
You could view a centrifuge as a Bucketsort of n atoms, parallelized over n cores(each atom acts as a processor).
You can make sorting faster by parallelization but only by a constant factor because the number of processors is limited, O(n/C) is still O(n) (CPUs have usually < 10 cores and GPUs < 6000)
The centrifuge is not sorting the nodes, it applies applies a force to them then they react in parallel to it.
So if you were to implement a bubble sort where each node is moving itself in parallel up or down based on it's "density", you'd have a centrifuge implementation.
Keep in mind that in the real world you can run a very large amount of parallel tasks where in a computer you can have a maximum of real parallel tasks equals to the number of physical processing units.
In the end, you would also be limited with the access to the list of elements because it cannot be modified simultaneously by two nodes...
Would it be possible with some overhead on each node (some value or
method tacked on to each of the nodes) to 'force' the order of the
list?
When we sort using computer programs we select a property of the values being sorted. That's commonly magnitude of the number or the alphabetical order.
Something like a centrifuge, where only each element cares about its
relative position in space (in relation to other nodes)
This analogy aptly reminds me of simple bubble sort. How smaller numbers bubble up in each iteration. Like your centrifuge logic.
So to answer this, don't we actually do something of that sort in software based sorting?
First of all, you are comparing two different contexts, one is logic(computer) and the other is physics which (so far) is proven that we can model some parts of it using mathematical formulas and we as programmers can use this formulas to simulate (some parts of) physics in the logic work (e.g physics engine in game engine).
Second We have some possibilities in the computer (logic) world that is nearly impossible in physics for example we can access memory and find the exact location of each entity at each time but in physics that is a huge problem Heisenberg's uncertainty principle.
Third If you want to map centrifuges and its operation in real world, to computer world, it is like someone (The God) has given you a super-computer with all the rules of physics applied and you are doing your small sorting in it (using centrifuge) and by saying that your sorting problem was solved in o(n) you are ignoring the huge physics simulation going on in background...
Consider: is "centrifuge sort" really scaling better? Think about what happens as you scale up.
The test tubes have to get longer and longer.
The heavy stuff has to travel further and further to get to the bottom.
The moment of inertia increases, requiring more power and longer times to accelerate up to sorting speed.
It's also worth considering other problems with centrifuge sort. For example, you can only operate on a narrow size scale. A computer sorting algorithm can handle integers from 1 to 2^1024 and beyond, no sweat. Put something that weighs 2^1024 times as much as a hydrogen atom into a centrifuge and, well, that's a black hole and the galaxy has been destroyed. The algorithm failed.
Of course the real answer here is that computational complexity is relative to some computational model, as mentioned in other answer. And "centrifuge sort" doesn't make sense in the context of common computational models, such as the RAM model or the IO model or multitape Turing machines.
Another perspective is that what you're describing with the centrifuge is analogous to what's been called the "spaghetti sort" (https://en.wikipedia.org/wiki/Spaghetti_sort). Say you have a box of uncooked spaghetti rods of varying lengths. Hold them in your fist, and loosen your hand to lower them vertically so the ends are all resting on a horizontal table. Boom! They're sorted by height. O(constant) time. (Or O(n) if you include picking the rods out by height and putting them in a . . . spaghetti rack, I guess?)
You can note there that it's O(constant) in the number of pieces of spaghetti, but, due to the finite speed of sound in spaghetti, it's O(n) in the length of the longest strand. So nothing comes for free.

Simple ordering for a linked list

I want to create a doubly linked list with an order sequence (an integer attribute) such that sorting by the order sequence could create an array that would effectively be equivalent to the linked list.
given: a <-> b <-> c
a.index > b.index
b.index > c.index
This index would need to handle efficiently arbitrary numbers of inserts.
Is there a known algorithm for accomplishing this?
The problem is when the list gets large and the index sequence has become packed. In that situation the list has to be scanned to put slack back in.
I'm just not sure how this should be accomplished. Ideally there would be some sort of automatic balancing so that this borrowing is both fast and rare.
The naive solution of changing all the left or right indecies by 1 to make room for the insert is O(n).
I'd prefer to use integers, as I know numbers tend to get less reliable in floating point as they approach zero in most implementations.
This is one of my favorite problems. In the literature, it's called "online list labeling", or just "list labeling". There's a bit on it in wikipedia here: https://en.wikipedia.org/wiki/Order-maintenance_problem#List-labeling
Probably the simplest algorithm that will be practical for your purposes is the first one in here: https://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf.
It handles insertions in amortized O(log N) time, and to manage N items, you have to use integers that are big enough to hold N^2. 64-bit integers are sufficient in almost all practical cases.
What I wound up going for was a roll-my-own solution, because it looked like the algorithm wanted to have the entire list in memory before it would insert the next node. And that is no good.
My idea is to borrow some of the ideas for the algorithm. What I did was make Ids ints and sort orders longs. Then the algorithm is lazy, stuffing entries anywhere they'll fit. Once it runs out of space in some little clump somewhere it begins a scan up and down from the clump and tries to establish an even spacing such that if there are n items scanned they need to share n^2 padding between them.
In theory this will mean over time the list will be perfectly padded, and given that my IDs are ints and my sort orders are longs, there will never be a scenario where you will not be able to achieve n^2 padding. I can't speak to the upper bounds on the number of operations, but my guts tell me that by doing polynomial work at 1/polynomial frequency, that I'll be doing just fine.

Optimized Algorithm: Fastest Way to Derive Sets

I'm writing a program for a competition and I need to be faster than all the other competitors. For this I need a little algorithm help; ideally I'd be using the fastest algorithm.
For this problem I am given 2 things. The first is a list of tuples, each of which contains exactly two elements (strings), each of which represents an item. The second is an integer, which indicates how many unique items there are in total. For example:
# of items = 3
[("ball","chair"),("ball","box"),("box","chair"),("chair","box")]
The same tuples can be repeated/ they are not necessarily unique.) My program is supposed to figure out the maximum number of tuples that can "agree" when the items are sorted into two groups. This means that if all the items are broken into two ideal groups, group 1 and group 2, what are the maximum number of tuples that can have their first item in group 1 and their second item in group 2.
For example, the answer to my earlier example would be 2, with "ball" in group 1 and "chair" and "box" in group 2, satisfying the first two tuples. I do not necessarily need know what items go in which group, I just need to know what the maximum number of satisfied tuples could be.
At the moment I'm trying a recursive approach, but its running on (n^2), far too inefficient in my opinion. Does anyone have a method that could produce a faster algorithm?
Thanks!!!!!!!!!!
Speed up approaches for your task:
1. Use integers
Convert the strings to integers (store the strings in an array and use the position for the tupples.
String[] words = {"ball", "chair", "box"};
In tuppls ball now has number 0 (pos 0 in array) , chair 1, box 2.
comparing ints is faster than Strings.
2. Avoid recursion
Recursion is slow, due the recursion overhead.
For example look at binarys search algorithm in a recursive implementatiion, then look how java implements binSearch() (with a while loop and iteration)
Recursion is helpfull if problems are so complex that a non recursive implementation is to complex for a human brain.
An iterataion is faster, but not in the case when you mimick recursive calls by implementing your own stack.
However you can start implementing using a recursiove algorithm, once it works and it is a suited algo, then try to convert to a non recursive implementation
3. if possible avoid objects
if you want the fastest, the now it becomes ugly!
A tuppel array can either be stored in as array of class Point(x,y) or probably faster,
as array of int:
Example:
(1,2), (2,3), (3,4) can be stored as array: (1,2,2,3,3,4)
This needs much less memory because an object needs at least 12 bytes (in java).
Less memory becomes faster, when the array are really big, then your structure will hopefully fits in the processor cache, while the objects array does not.
4. Programming language
In C it will be faster than in Java.
Maximum cut is a special case of your problem, so I doubt you have a quadratic algorithm for it. (Maximum cut is NP-complete and it corresponds to the case where every tuple (A,B) also appears in reverse as (B,A) the same number of times.)
The best strategy for you to try here is "branch and bound." It's a variant of the straightforward recursive search you've probably already coded up. You keep track of the value of the best solution you've found so far. In each recursive call, you check whether it's even possible to beat the best known solution with the choices you've fixed so far.
One thing that may help (or may hurt) is to "probe": for each as-yet-unfixed item, see if putting that item on one of the two sides leads only to suboptimal solutions; if so, you know that item needs to be on the other side.
Another useful trick is to recurse on items that appear frequently both as the first element and as the second element of your tuples.
You should pay particular attention to the "bound" step --- finding an upper bound on the best possible solution given the choices you've fixed.

Looking for a sort algorithm with as few as possible compare operations

I want to sort items where the comparison is performed by humans:
Pictures
Priority of work items
...
For these tasks the number of comparisons is the limiting factor for performance.
What is the minimum number of comparisons needed (I assume > N for N items)?
Which algorithm guarantees this minimum number?
To answer this, we need to make a lot of assumptions.
Let's assume we are sorting pictures by cuteness. The goal is to get the maximum usable information from the human in the least amount of time. This interaction will dominate all other computation, so it's the only one that counts.
As someone else mentioned, humans can deal well with ordering several items in one interaction. Let's say we can get eight items in relative order per round.
Each round introduces seven edges into a directed graph where the nodes are the pictures. If node A is reachable from node B, then node A is cuter than node B. Keep this graph in mind.
Now, let me tell you about a problem the Navy and the Air Force solve differently. They both want to get a group of people in height order and quickly. The Navy tells people to get in line, then if you're shorter than the guy in front of you, switch places, and repeat until done. In the worst case, it's N*N comparison.
The Air Force tells people to stand in a square grid. They shuffle front-to-back on sqrt(N) people, which means worst case sqrt(N)*sqrt(N) == N comparisons. However, the people are only sorted along one dimension. So therefore, the people face left, then do the same shuffle again. Now we're up to 2*N comparisons, and the sort is still imperfect but it's good enough for government work. There's a short corner, a tall corner opposite, and a clear diagonal height gradient.
You can see how the Air Force method gets results in less time if you don't care about perfection. You can also see how to get the perfection effectively. You already know that the very shortest and very longest men are in two corners. The second-shortest might be behind or beside the shortest, the third shortest might be behind or beside him. In general, someone's height rank is also his maximum possible Manhattan distance from the short corner.
Looking back at the graph analogy, the eight nodes to present each round are eight of those with the currently most common length of longest inbound path. The length of the longest inbound path also represents the node's minimum possible sorted rank.
You'll use a lot of CPU following this plan, but you will make the best possible use of your human resources.
From an assignment I once did on this very subject ...
The comparison counts are for various sorting algorithms operating on data in a random order
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 31388 48792 25105 27646 1554230
5000 67818 107632 55216 65706 6082243
10000 153838 235641 120394 141623 25430257
20000 320535 510824 260995 300319 100361684
40000 759202 1101835 561676 685937
80000 1561245 2363171 1203335 1438017
160000 3295500 5045861 2567554 3047186
These comparison counts are for various sorting algorithms operating on data that is started 'nearly sorted'. Amongst other things it shows a the pathological case of quicksort.
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 72029 46428 16001 70618 76050
5000 181370 102934 34503 190391 3016042
10000 383228 226223 74006 303128 12793735
20000 940771 491648 158015 744557 50456526
40000 2208720 1065689 336031 1634659
80000 4669465 2289350 712062 3820384
160000 11748287 4878598 1504127 10173850
From this we can see that merge sort is the best by number of comparisons.
I can't remember what the modifications to the quick sort algorithm were, but I believe it was something that used insertion sorts once the individual chunks got down to a certain size. This sort of thing is commonly done to optimise quicksort.
You might also want to look up Tadao Takaoka's 'Minimal Merge Sort', which is a more efficient version of the merge sort.
Pigeon hole sorting is order N and works well with humans if the data can be pigeon holed. A good example would be counting votes in an election.
You should consider that humans might make non-transitive comparisons, e.g. they favor A over B, B over C but also C over A. So when choosing your sort algorithm, make sure it doesn't completely break when that happens.
People are really good at ordering 5-10 things from best to worst and come up with more consistent results when doing so. I think trying to apply a classical sorting algo might not work here because of the typically human multi-compare approach.
I'd argue that you should have a round robin type approach and try to bucket things into their most consistent groups each time. Each iteration would only make the result more certain.
It'd be interesting to write too :)
If comparisons are expensive relative to book-keeping costs, you might try the following algorithm which I call "tournament sort". First, some definitions:
Every node has a numeric "score" property (which must be able to hold values from 1 to the number of nodes), and a "last-beat" and "fellow-loser" properties, which must be able to hold node references.
A node is "better" than another node if it should be output before the other.
An element is considered "eligible" if there are no elements known to be better than it which have been output, and "ineligible" if any element which has not been output is known to be better than it.
The "score" of a node is the number of nodes it's known to be better than, plus one.
To run the algorithm, initially assign every node a score of 1. Repeatedly compare the two lowest-scoring eligible nodes; after each comparison, mark the loser "ineligible", and add the loser's score to the winner's (the loser's score is unaltered). Set the loser's "fellow loser" property to the winner's "last-beat", and the winner's "last-beat" property to the loser. Iterate this until only one eligible node remains. Output that node, and make eligible all nodes the winner beat (using the winner's "last-beat" and the chain of "fellow-loser" properties). Then continue the algorithm on the remaining nodes.
The number of comparisons with 1,000,000 items was slightly lower than that of a stock library implementation of Quicksort; I'm not sure how the algorithm would compare against a more modern version of QuickSort. Bookkeeping costs are significant, but if comparisons are sufficiently expensive the savings could possibly be worth it. One interesting feature of this algorithm is that it will only perform comparisons relevant to determining the next node to be output; I know of no other algorithm with that feature.
I don't think you're likely to get a better answer than the Wikipedia page on sorting.
Summary:
For arbitrary comparisons (where you can't use something like radix sorting) the best you can achieve is O(n log n)
Various algorithms achieve this - see the "comparison of algorithms" section.
The commonly used QuickSort is O(n log n) in a typical case, but O(n^2) in the worst case; there are often ways to avoid this, but if you're really worried about the cost of comparisons, I'd go with something like MergeSort or a HeapSort. It partly depends on your existing data structures.
If humans are doing the comparisons, are they also doing the sorting? Do you have a fixed data structure you need to use, or could you effectively create a copy using a balanced binary tree insertion sort? What are the storage requirements?
Here is a comparison of algorithms. The two better candidates are Quick Sort and Merge Sort. Quick Sort is in general better, but has a worse worst case performance.
Merge sort is definately the way to go here as you can use a Map/Reduce type algorithm to have several humans doing the comparisons in parallel.
Quicksort is essentially a single threaded sort algorithm.
You could also tweak the merge sort algorithm so that instead of comparing two objects you present your human with a list of say five items and ask him or her to rank them.
Another possibility would be to use a ranking system as used by the famous "Hot or Not" web site. This requires many many more comparisons, but, the comparisons can happen in any sequence and in parallel, this would work faster than a classic sort provided you have enough huminoids at your disposal.
The questions raises more questions really.
Are we talking a single human performing the comparisons? It's a very different challenge if you are talking a group of humans trying to arrange objects in order.
What about the questions of trust and error? Not everyone can be trusted or to get everything right - certain sorts would go catastrophically wrong if at any given point you provided the wrong answer to a single comparison.
What about subjectivity? "Rank these pictures in order of cuteness". Once you get to this point, it could get really complex. As someone else mentions, something like "hot or not" is the simplest conceptually, but isn't very efficient. At it's most complex, I'd say that google is a way of sorting objects into an order, where the search engine is inferring the comparisons made by humans.
The best one would be the merge sort
The minimum run time is n*log(n) [Base 2]
The way it is implemented is
If the list is of length 0 or 1, then it is already sorted.
Otherwise:
Divide the unsorted list into two sublists of about half the size.
Sort each sublist recursively by re-applying merge sort.
Merge the two sublists back into one sorted list.

Resources