Sort elements with the fewest comparisons possible - algorithm

I have a folder full of images. There are too many to just 'rank'. I made a program that shows two at a time and let's the user pick which one of the two is better. At the end I would like all of the photos to be ordered from best to worst.
I am purely trying to optimize for the fewest amount of comparisons possible. I don't care if the program runs in n cubed time. I've read the other questions here with similar questions but I'm looking for something more advanced.
I'm thinking maybe some sort of algorithm that based on what comparisons you've already made, the program chooses two images to compare that will offer the most information. Maybe even an algorithm that makes complex connections to help determine the orders and potential orders.
Like I said I don't care if it is slow just purely trying to minimize comparisons

If total order exists, you need at least nlog2(n) comparisons. It can be easily proved mathematically. No way around. So regular sorting algorithms in nlog(n) will do the job.
What you are trying to do is called 'topological sort'. Google it and read about it in wikipedia. You can achieve partial sorts in less comparisons. Its kind of a graduate sort. The more comparisons you get, the better the result will be.
However, what do you do if no total order exists? Humans are not able to generate a total order for subjective tasks.
For example picture 1 is better than 2, 2 is better than 3 but 3 is better than 1.
In this case no sorting algorithm can produce a permutation which will match all the decisions. During topological sort, you can detect those inconsitent decisions and get rid of them.

You are looking for a sorting algorithm - pick one. Most algorithms just need a comparison function (a < b?). This is when you show the user two pictures and he has to choose the better one.
You might wan't to read trough some of the algorithms and choose the best one for you. E.g. on quicksort, you would pick a random picture and the user have to compare this picture against all other pictures in the first round - might be too boring from the end user perspective.

Related

Sorting based on fuzzy criteria OR Create an acceptable order with only n comparisons

I'm looking for an algorithm to sort a large number of items using the fewest comparisons. My specific case makes it unclear which of the obvious approaches is appropriate: the comparison function is slow and non-deterministic so it can make errors, because it's a human brain.
In other words, I want to sort arbitrary items on my computer into a list from "best" to "worst" by comparing them two at a time. They could be images, strings, songs, anything. My program would display two things for me to compare. The program doesn't know anything about what is being compared, its job is just to decide which pairs to compare. So that gives the following criteria
It's a comparison sort - The only time the user sees items is when comparing two of them.
It's an out-of-place sort - I don't want to move the actual files, so items can have placeholder values or metadata files
Comparisons are slow - at least compared to a computer. Data locality won't have an effect, but comparing obvious disparities will be quick, similar items will be slow.
Comparison is subjective - comparison results could vary slightly at different times.
Items don't have a total order - the desired outcome is an order that is "good enough" at runtime, which will vary depending on context.
Items will rarely be almost sorted - in fact, the goal is to get random data to an almost-sorted state.
Sets usually will contain runs - If every song on an album is a banger, it might be faster because of (2) to compare them to the next album rather than each other. Imagine a set {10.0, 10.2, 10.9, 5.0, 4.2, 6.9} where integer comparisons are fast but float comparisons are very slow.
There are many different ways to approach this problem. In addition to sorting algorithms, it's similar to creating tournament brackets, and voting systems. As that table illustrates, there are countless ways to define and solve the problem based on various criteria. For this question I'm only interested in treating it as a sorting problem where the user is comparing two items at a time and choosing a preference. So what approach makes sense for either of the two following versions of the question?
How to choose pairs to get the best result in O(n) or fewer operations? (for example compare random pairs of items with n/2 operations, then use n/2 operations to spot check or fine-tune)
How to create the best order with additional operations but no additional comparisons (e.g. similar items are sorted into buckets or losers are removed, anything that doesn't increase the number of comparisons)
The representation of comparison results can be anything that makes the solution convenient - it can be dictionary keys corresponding to the final order, a "score" based on number of comparisons, a database, etc.
Edit: The comments have helped clarify the question in that the goal is similar to something like bucket sort, samplesort or the partitioning phase of quicksort. So the question could be rephrased as how to choose good partitions based on comparisons, but I'm also interested in any other ways of using the comparison results that wouldn't be applicable in a standard in-place comparison sort like keeping a score for each item.

Is there a way to calculate the progress of a sorting algorithm?

I'm trying to make a subjective sort based on shell sort. I'm referring to the original (Donald Shell's) algorithm. I already made all the logic, where it is exactly the same as the shell sort, but instead of the computer calculate what is greater, the user determines subjectively what is greater.
But the problem is that I would like to display a percentage or something to the user know how far in the sorting it is already. That's why I want to find a way to know it.
I tried asking here(What is the formula to get the number of passes in a shell sort?), but maybe I didn't express myself well last time and they closed the question.
I tried first associating the progress with the number of passes in the array in the shell sort. But lately, I noticed it is not a fixed number. So if you have an idea of how it is the best way to display the progress of the sorting, I will really appreciate it.
I did this formula displaying it by color based on the number of passes, it is the closest I could get, but it doesn't match perfectly the maximum range for the color list.
(Code in Dart/Flutter)
List<Color> colors = [
Color(0xFFFF0000),//red
Color(0xFFFF5500),
Color(0xFFFFAA00),
Color(0xFFFFFF00),//yellow
Color(0xFFAAFF00),
Color(0xFF00FF00),
Color(0xFF00FF00),//green
];
[...]
style: TextStyle(
color: colors[(((pass - 1) * (colors.length - 1)) / sqrt(a.length).ceil()).floor()]
),
[...]
It doesn't need to be this way I tried to do, so please if you have an idea how to display the progress of the sorting please share it.
EDIT: I think I found the answer!! At least for shell sort, it is working based on the number os passes through the array.
Just changing the sqrt(a.length).ceil() with (log(a.length) / log(2)).floor()
This line:
color: colors[(((pass - 1) * (colors.length - 1)) / (log(a.length) / log(2)).floor()).floor()]),
How far along you are in many types of sorts usually depends on the initial order of the elements to be sorted.
For shellsort you have the individual passes further complicating the determination process.
As an example and to illustrate the problem, take insertion sort:
It is the fastest sort of all in one specific set of circumstances, namely to sort a vector that is already sorted in the intended direction - requiring n-1 comparisons.
It is one of the slowest in the opposite circumstance, sorting a vector that is already sorted but in the opposite direction - requiring (n*(n-1))/2 comparisons
Assuming that n=100, the best case is 99 and the worst 4950 comparisons. That's a factor of 1:50 in the number of comparisons required. So when you've done 50 comparisons, you're 50% through the best case or 1% through the worst.
Shellsort does not have as good a case for already sorted data as insertion sort but it is nonetheless very good. The opposite case - the worst case for insertion sort - is actually not the worst case for shellsort and it is much faster than insertion sort's. Shellsort's worst case is also much better than insertion sorts worst. Which means that for a given n you will know exactly the best and worst cases for insertion sort and you will know that shellsort will be somewhat slower at the best case and significantly faster than the worst - if that helps in your quest.
But however you look at it, you won't be able to reliably predict how far along in a (shell)sort you are unless you know how many comparisons are required for the specific data and you only know that after you have sorted it.
Maybe you should use a progress bar like Microsoft uses in Windows: it starts off really quickly but then suddenly realizes that it is halfway along and maybe it should slow down so as not to reach the end even though a lot of sorting remains. The last few millimeters of its travel may take many minutes in some circumstances.

Finding the average of large list of numbers

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?
If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.
First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

ranking based on user preference

I am trying to come up with a algorithm for the following problem.
There is a set of N objects with M different variations of each object. The goal is to find which variation is the best for each object based on feedback from different users.
At the end, the users will be placed in a category to determine which category prefers which variation.
It is required that at most two variations of an object are placed side by side.
The problem with this is that if M is large then the number of possible combinations become too large and the user may become disinterested and potentially skew the results.
The Elo algorithm/score can be used once I know the order of selection from the user as discussed in this this post
Comparison-based ranking algorithm
Question:
Is there an algorithm that can reduce the number of possible combinations presented to a user and still get correct order?
example: 7 different types of fruits. Each fruit is available in 5 different shapes. The users give their ranking of 1-5 for each fruit based on the size they prefer. This means that for each fruit there are max 10 combinations the user has to choose from (since sizes are different, no point presenting as {1,1}). How would I reduce "10 combinations" ?
If the user's preferences are always consistent with a total order, and you can change comparisons to take account of the results of comparisons made so far, you just need an efficient sorting algorithm. For 5 items it seems that you need a minimum of 7 comparisons - see Sorting 5 elements with minimum element comparison. You could also look at http://en.wikipedia.org/wiki/Sorting_network.
In general, when you are trying to produce some sort of experimental design, you will often find that making random comparisons, although not optimum, isn't too far away from the best possible answer.

Looking for a sort algorithm with as few as possible compare operations

I want to sort items where the comparison is performed by humans:
Pictures
Priority of work items
...
For these tasks the number of comparisons is the limiting factor for performance.
What is the minimum number of comparisons needed (I assume > N for N items)?
Which algorithm guarantees this minimum number?
To answer this, we need to make a lot of assumptions.
Let's assume we are sorting pictures by cuteness. The goal is to get the maximum usable information from the human in the least amount of time. This interaction will dominate all other computation, so it's the only one that counts.
As someone else mentioned, humans can deal well with ordering several items in one interaction. Let's say we can get eight items in relative order per round.
Each round introduces seven edges into a directed graph where the nodes are the pictures. If node A is reachable from node B, then node A is cuter than node B. Keep this graph in mind.
Now, let me tell you about a problem the Navy and the Air Force solve differently. They both want to get a group of people in height order and quickly. The Navy tells people to get in line, then if you're shorter than the guy in front of you, switch places, and repeat until done. In the worst case, it's N*N comparison.
The Air Force tells people to stand in a square grid. They shuffle front-to-back on sqrt(N) people, which means worst case sqrt(N)*sqrt(N) == N comparisons. However, the people are only sorted along one dimension. So therefore, the people face left, then do the same shuffle again. Now we're up to 2*N comparisons, and the sort is still imperfect but it's good enough for government work. There's a short corner, a tall corner opposite, and a clear diagonal height gradient.
You can see how the Air Force method gets results in less time if you don't care about perfection. You can also see how to get the perfection effectively. You already know that the very shortest and very longest men are in two corners. The second-shortest might be behind or beside the shortest, the third shortest might be behind or beside him. In general, someone's height rank is also his maximum possible Manhattan distance from the short corner.
Looking back at the graph analogy, the eight nodes to present each round are eight of those with the currently most common length of longest inbound path. The length of the longest inbound path also represents the node's minimum possible sorted rank.
You'll use a lot of CPU following this plan, but you will make the best possible use of your human resources.
From an assignment I once did on this very subject ...
The comparison counts are for various sorting algorithms operating on data in a random order
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 31388 48792 25105 27646 1554230
5000 67818 107632 55216 65706 6082243
10000 153838 235641 120394 141623 25430257
20000 320535 510824 260995 300319 100361684
40000 759202 1101835 561676 685937
80000 1561245 2363171 1203335 1438017
160000 3295500 5045861 2567554 3047186
These comparison counts are for various sorting algorithms operating on data that is started 'nearly sorted'. Amongst other things it shows a the pathological case of quicksort.
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 72029 46428 16001 70618 76050
5000 181370 102934 34503 190391 3016042
10000 383228 226223 74006 303128 12793735
20000 940771 491648 158015 744557 50456526
40000 2208720 1065689 336031 1634659
80000 4669465 2289350 712062 3820384
160000 11748287 4878598 1504127 10173850
From this we can see that merge sort is the best by number of comparisons.
I can't remember what the modifications to the quick sort algorithm were, but I believe it was something that used insertion sorts once the individual chunks got down to a certain size. This sort of thing is commonly done to optimise quicksort.
You might also want to look up Tadao Takaoka's 'Minimal Merge Sort', which is a more efficient version of the merge sort.
Pigeon hole sorting is order N and works well with humans if the data can be pigeon holed. A good example would be counting votes in an election.
You should consider that humans might make non-transitive comparisons, e.g. they favor A over B, B over C but also C over A. So when choosing your sort algorithm, make sure it doesn't completely break when that happens.
People are really good at ordering 5-10 things from best to worst and come up with more consistent results when doing so. I think trying to apply a classical sorting algo might not work here because of the typically human multi-compare approach.
I'd argue that you should have a round robin type approach and try to bucket things into their most consistent groups each time. Each iteration would only make the result more certain.
It'd be interesting to write too :)
If comparisons are expensive relative to book-keeping costs, you might try the following algorithm which I call "tournament sort". First, some definitions:
Every node has a numeric "score" property (which must be able to hold values from 1 to the number of nodes), and a "last-beat" and "fellow-loser" properties, which must be able to hold node references.
A node is "better" than another node if it should be output before the other.
An element is considered "eligible" if there are no elements known to be better than it which have been output, and "ineligible" if any element which has not been output is known to be better than it.
The "score" of a node is the number of nodes it's known to be better than, plus one.
To run the algorithm, initially assign every node a score of 1. Repeatedly compare the two lowest-scoring eligible nodes; after each comparison, mark the loser "ineligible", and add the loser's score to the winner's (the loser's score is unaltered). Set the loser's "fellow loser" property to the winner's "last-beat", and the winner's "last-beat" property to the loser. Iterate this until only one eligible node remains. Output that node, and make eligible all nodes the winner beat (using the winner's "last-beat" and the chain of "fellow-loser" properties). Then continue the algorithm on the remaining nodes.
The number of comparisons with 1,000,000 items was slightly lower than that of a stock library implementation of Quicksort; I'm not sure how the algorithm would compare against a more modern version of QuickSort. Bookkeeping costs are significant, but if comparisons are sufficiently expensive the savings could possibly be worth it. One interesting feature of this algorithm is that it will only perform comparisons relevant to determining the next node to be output; I know of no other algorithm with that feature.
I don't think you're likely to get a better answer than the Wikipedia page on sorting.
Summary:
For arbitrary comparisons (where you can't use something like radix sorting) the best you can achieve is O(n log n)
Various algorithms achieve this - see the "comparison of algorithms" section.
The commonly used QuickSort is O(n log n) in a typical case, but O(n^2) in the worst case; there are often ways to avoid this, but if you're really worried about the cost of comparisons, I'd go with something like MergeSort or a HeapSort. It partly depends on your existing data structures.
If humans are doing the comparisons, are they also doing the sorting? Do you have a fixed data structure you need to use, or could you effectively create a copy using a balanced binary tree insertion sort? What are the storage requirements?
Here is a comparison of algorithms. The two better candidates are Quick Sort and Merge Sort. Quick Sort is in general better, but has a worse worst case performance.
Merge sort is definately the way to go here as you can use a Map/Reduce type algorithm to have several humans doing the comparisons in parallel.
Quicksort is essentially a single threaded sort algorithm.
You could also tweak the merge sort algorithm so that instead of comparing two objects you present your human with a list of say five items and ask him or her to rank them.
Another possibility would be to use a ranking system as used by the famous "Hot or Not" web site. This requires many many more comparisons, but, the comparisons can happen in any sequence and in parallel, this would work faster than a classic sort provided you have enough huminoids at your disposal.
The questions raises more questions really.
Are we talking a single human performing the comparisons? It's a very different challenge if you are talking a group of humans trying to arrange objects in order.
What about the questions of trust and error? Not everyone can be trusted or to get everything right - certain sorts would go catastrophically wrong if at any given point you provided the wrong answer to a single comparison.
What about subjectivity? "Rank these pictures in order of cuteness". Once you get to this point, it could get really complex. As someone else mentions, something like "hot or not" is the simplest conceptually, but isn't very efficient. At it's most complex, I'd say that google is a way of sorting objects into an order, where the search engine is inferring the comparisons made by humans.
The best one would be the merge sort
The minimum run time is n*log(n) [Base 2]
The way it is implemented is
If the list is of length 0 or 1, then it is already sorted.
Otherwise:
Divide the unsorted list into two sublists of about half the size.
Sort each sublist recursively by re-applying merge sort.
Merge the two sublists back into one sorted list.

Resources