Looking for a sort algorithm with as few as possible compare operations - algorithm

I want to sort items where the comparison is performed by humans:
Pictures
Priority of work items
...
For these tasks the number of comparisons is the limiting factor for performance.
What is the minimum number of comparisons needed (I assume > N for N items)?
Which algorithm guarantees this minimum number?

To answer this, we need to make a lot of assumptions.
Let's assume we are sorting pictures by cuteness. The goal is to get the maximum usable information from the human in the least amount of time. This interaction will dominate all other computation, so it's the only one that counts.
As someone else mentioned, humans can deal well with ordering several items in one interaction. Let's say we can get eight items in relative order per round.
Each round introduces seven edges into a directed graph where the nodes are the pictures. If node A is reachable from node B, then node A is cuter than node B. Keep this graph in mind.
Now, let me tell you about a problem the Navy and the Air Force solve differently. They both want to get a group of people in height order and quickly. The Navy tells people to get in line, then if you're shorter than the guy in front of you, switch places, and repeat until done. In the worst case, it's N*N comparison.
The Air Force tells people to stand in a square grid. They shuffle front-to-back on sqrt(N) people, which means worst case sqrt(N)*sqrt(N) == N comparisons. However, the people are only sorted along one dimension. So therefore, the people face left, then do the same shuffle again. Now we're up to 2*N comparisons, and the sort is still imperfect but it's good enough for government work. There's a short corner, a tall corner opposite, and a clear diagonal height gradient.
You can see how the Air Force method gets results in less time if you don't care about perfection. You can also see how to get the perfection effectively. You already know that the very shortest and very longest men are in two corners. The second-shortest might be behind or beside the shortest, the third shortest might be behind or beside him. In general, someone's height rank is also his maximum possible Manhattan distance from the short corner.
Looking back at the graph analogy, the eight nodes to present each round are eight of those with the currently most common length of longest inbound path. The length of the longest inbound path also represents the node's minimum possible sorted rank.
You'll use a lot of CPU following this plan, but you will make the best possible use of your human resources.

From an assignment I once did on this very subject ...
The comparison counts are for various sorting algorithms operating on data in a random order
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 31388 48792 25105 27646 1554230
5000 67818 107632 55216 65706 6082243
10000 153838 235641 120394 141623 25430257
20000 320535 510824 260995 300319 100361684
40000 759202 1101835 561676 685937
80000 1561245 2363171 1203335 1438017
160000 3295500 5045861 2567554 3047186
These comparison counts are for various sorting algorithms operating on data that is started 'nearly sorted'. Amongst other things it shows a the pathological case of quicksort.
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 72029 46428 16001 70618 76050
5000 181370 102934 34503 190391 3016042
10000 383228 226223 74006 303128 12793735
20000 940771 491648 158015 744557 50456526
40000 2208720 1065689 336031 1634659
80000 4669465 2289350 712062 3820384
160000 11748287 4878598 1504127 10173850
From this we can see that merge sort is the best by number of comparisons.
I can't remember what the modifications to the quick sort algorithm were, but I believe it was something that used insertion sorts once the individual chunks got down to a certain size. This sort of thing is commonly done to optimise quicksort.
You might also want to look up Tadao Takaoka's 'Minimal Merge Sort', which is a more efficient version of the merge sort.

Pigeon hole sorting is order N and works well with humans if the data can be pigeon holed. A good example would be counting votes in an election.

You should consider that humans might make non-transitive comparisons, e.g. they favor A over B, B over C but also C over A. So when choosing your sort algorithm, make sure it doesn't completely break when that happens.

People are really good at ordering 5-10 things from best to worst and come up with more consistent results when doing so. I think trying to apply a classical sorting algo might not work here because of the typically human multi-compare approach.
I'd argue that you should have a round robin type approach and try to bucket things into their most consistent groups each time. Each iteration would only make the result more certain.
It'd be interesting to write too :)

If comparisons are expensive relative to book-keeping costs, you might try the following algorithm which I call "tournament sort". First, some definitions:
Every node has a numeric "score" property (which must be able to hold values from 1 to the number of nodes), and a "last-beat" and "fellow-loser" properties, which must be able to hold node references.
A node is "better" than another node if it should be output before the other.
An element is considered "eligible" if there are no elements known to be better than it which have been output, and "ineligible" if any element which has not been output is known to be better than it.
The "score" of a node is the number of nodes it's known to be better than, plus one.
To run the algorithm, initially assign every node a score of 1. Repeatedly compare the two lowest-scoring eligible nodes; after each comparison, mark the loser "ineligible", and add the loser's score to the winner's (the loser's score is unaltered). Set the loser's "fellow loser" property to the winner's "last-beat", and the winner's "last-beat" property to the loser. Iterate this until only one eligible node remains. Output that node, and make eligible all nodes the winner beat (using the winner's "last-beat" and the chain of "fellow-loser" properties). Then continue the algorithm on the remaining nodes.
The number of comparisons with 1,000,000 items was slightly lower than that of a stock library implementation of Quicksort; I'm not sure how the algorithm would compare against a more modern version of QuickSort. Bookkeeping costs are significant, but if comparisons are sufficiently expensive the savings could possibly be worth it. One interesting feature of this algorithm is that it will only perform comparisons relevant to determining the next node to be output; I know of no other algorithm with that feature.

I don't think you're likely to get a better answer than the Wikipedia page on sorting.
Summary:
For arbitrary comparisons (where you can't use something like radix sorting) the best you can achieve is O(n log n)
Various algorithms achieve this - see the "comparison of algorithms" section.
The commonly used QuickSort is O(n log n) in a typical case, but O(n^2) in the worst case; there are often ways to avoid this, but if you're really worried about the cost of comparisons, I'd go with something like MergeSort or a HeapSort. It partly depends on your existing data structures.
If humans are doing the comparisons, are they also doing the sorting? Do you have a fixed data structure you need to use, or could you effectively create a copy using a balanced binary tree insertion sort? What are the storage requirements?

Here is a comparison of algorithms. The two better candidates are Quick Sort and Merge Sort. Quick Sort is in general better, but has a worse worst case performance.

Merge sort is definately the way to go here as you can use a Map/Reduce type algorithm to have several humans doing the comparisons in parallel.
Quicksort is essentially a single threaded sort algorithm.
You could also tweak the merge sort algorithm so that instead of comparing two objects you present your human with a list of say five items and ask him or her to rank them.
Another possibility would be to use a ranking system as used by the famous "Hot or Not" web site. This requires many many more comparisons, but, the comparisons can happen in any sequence and in parallel, this would work faster than a classic sort provided you have enough huminoids at your disposal.

The questions raises more questions really.
Are we talking a single human performing the comparisons? It's a very different challenge if you are talking a group of humans trying to arrange objects in order.
What about the questions of trust and error? Not everyone can be trusted or to get everything right - certain sorts would go catastrophically wrong if at any given point you provided the wrong answer to a single comparison.
What about subjectivity? "Rank these pictures in order of cuteness". Once you get to this point, it could get really complex. As someone else mentions, something like "hot or not" is the simplest conceptually, but isn't very efficient. At it's most complex, I'd say that google is a way of sorting objects into an order, where the search engine is inferring the comparisons made by humans.

The best one would be the merge sort
The minimum run time is n*log(n) [Base 2]
The way it is implemented is
If the list is of length 0 or 1, then it is already sorted.
Otherwise:
Divide the unsorted list into two sublists of about half the size.
Sort each sublist recursively by re-applying merge sort.
Merge the two sublists back into one sorted list.

Related

Sorting in Computer Science vs. sorting in the 'real' world

I was thinking about sorting algorithms in software, and possible ways one could surmount the O(nlogn) roadblock. I don't think it IS possible to sort faster in a practical sense, so please don't think that I do.
With that said, it seems with almost all sorting algorithms, the software must know the position of each element. Which makes sense, otherwise, how would it know where to place each element according to some sorting criteria?
But when I crossed this thinking with the real world, a centrifuge has no idea what position each molecule is in when it 'sorts' the molecules by density. In fact, it doesn't care about the position of each molecule. However it can sort trillions upon trillions of items in a relatively short period of time, due to the fact that each molecule follows density and gravitational laws - which got me thinking.
Would it be possible with some overhead on each node (some value or method tacked on to each of the nodes) to 'force' the order of the list? Something like a centrifuge, where only each element cares about its relative position in space (in relation to other nodes). Or, does this violate some rule in computation?
I think one of the big points brought up here is the quantum mechanical effects of nature and how they apply in parallel to all particles simultaneously.
Perhaps classical computers inherently restrict sorting to the domain of O(nlogn), where as quantum computers may be able to cross that threshold into O(logn) algorithms that act in parallel.
The point that a centrifuge being basically a parallel bubble sort seems to be correct, which has a time complexity of O(n).
I guess the next thought is that if nature can sort in O(n), why can't computers?
EDIT: I had misunderstood the mechanism of a centrifuge and it appears that it does a comparison, a massively-parallel one at that. However there are physical processes that operate on a property of the entity being sorted rather than comparing two properties. This answer covers algorithms that are of that nature.
A centrifuge applies a sorting mechanism that doesn't really work by means of comparisons between elements, but actually by a property ('centrifugal force') on each individual element in isolation.Some sorting algorithms fall into this theme, especially Radix Sort. When this sorting algorithm is parallelized it should approach the example of a centrifuge.
Some other non-comparative sorting algorithms are Bucket sort and Counting Sort. You may find that Bucket sort also fits into the general idea of a centrifuge (the radius could correspond to a bin).
Another so-called 'sorting algorithm' where each element is considered in isolation is the Sleep Sort. Here time rather than the centrifugal force acts as the magnitude used for sorting.
Computational complexity is always defined with respect to some computational model. For example, an algorithm that's O(n) on a typical computer might be O(2n) if implemented in Brainfuck.
The centrifuge computational model has some interesting properties; for example:
it supports arbitrary parallelism; no matter how many particles are in the solution, they can all be sorted simultaneously.
it doesn't give a strict linear sort of particles by mass, but rather a very close (low-energy) approximation.
it's not feasible to examine the individual particles in the result.
it's not possible to sort particles by different properties; only mass is supported.
Given that we don't have the ability to implement something like this in general-purpose computing hardware, the model may not have practical relevance; but it can still be worth examining, to see if there's anything to be learned from it. Nondeterministic algorithms and quantum algorithms have both been active areas of research, for example, even though neither is actually implementable today.
The trick is there, that you only have a probability of sorting your list using a centrifuge. As with other real-world sorts [citation needed], you can change the probability that your have sorted your list, but never be certain without checking all the values (atoms).
Consider the question: "How long should you run your centrifuge for?"
If you only ran it for a picosecond, your sample may be less sorted than the initial state.. or if you ran it for a few days, it may be completely sorted. However, you wouldn't know without actually checking the contents.
A real world example of a computer based "ordering" would be autonomous drones that cooperatively work with each other, known as "drone swarms". The drones act and communicate both as individuals and as a group, and can track multiple targets. The drones collectively decide which drones will follow which targets and the obvious need to avoid collisions between drones. The early versions of this were drones that moved through way points while staying in formation, but the formation could change.
For a "sort", the drones could be programmed to form a line or pattern in a specific order, initially released in any permutation or shape, and collectively and in parallel they would quickly form the ordered line or pattern.
Getting back to a computer based sort, one issue is that there's one main memory bus, and there's no way for a large number of objects to move about in memory in parallel.
know the position of each element
In the case of a tape sort, the position of each element (record) is only "known" to the "tape", not to the computer. A tape based sort only needs to work with two elements at a time, and a way to denote run boundaries on a tape (file mark, or a record of different size).
IMHO, people overthink log(n). O(nlog(n)) IS practically O(n). And you need O(n) just to read the data.
Many algorithms such as quicksort do provide a very fast way to sort elements. You could implement variations of quicksort that would be very fast in practice.
Inherently all physical systems are infinitely parallel. You might have a buttload of atoms in a grain of sand, nature has enough computational power to figure out where each electron in each atom should be. So if you had enough computational resources (O(n) processors) you could sort n numbers in log(n) time.
From comments:
Given a physical processor that has k number of elements, it can achieve a parallelness of at most O(k). If you process n numbers arbitrarily, it would still process it at a rate related to k. Also, you could formulate this problem physically. You could create n steel balls with weights proportional to the number you want to encode, which could be solved by a centrifuge in a theory. But here the amount of atoms you are using is proportional to n. Whereas in a standard case you have a limited number of atoms in a processor.
Another way to think about this is, say you have a small processor attached to each number and each processor can communicate with its neighbors, you could sort all those numbers in O(log(n)) time.
I worked in an office summers after high school when I started college. I had studied in AP Computer Science, among other things, sorting and searching.
I applied this knowledge in several physical systems that I can recall:
Natural merge sort to start…
A system printed multipart forms including a file-card-sized tear off, which needed to be filed in a bank of drawers.
I started with a pile of them and sorted the pile to begin with. The first step is picking up 5 or so, few enough to be easily placed in order in your hand. Place the sorted packet down, criss-crossing each stack to keep them separate.
Then, merge each pair of stacks, producing a larger stack. Repeat until there is only one stack.
…Insertion sort to complete
It is easier to file the sorted cards, as each next one is a little farther down the same open drawer.
Radix sort
This one nobody else understood how I did it so fast, despite repeated tries to teach it.
A large box of check stubs (the size of punch cards) needs to be sorted. It looks like playing solitaire on a large table—deal out, stack up, repeat.
In general
30 years ago, I did notice what you’re asking about: the ideas transfer to physical systems quite directly because there are relative costs of comparisons and handling records, and levels of caching.
Going beyond well-understood equivalents
I recall an essay about your topic, and it brought up the spaghetti sort. You trim a length of dried noodle to indicate the key value, and label it with the record ID. This is O(n), simply processing each item once.
Then you grab the bundle and tap one end on the table. They align on the bottom edges, and they are now sorted. You can trivially take off the longest one, and repeat. The read-out is also O(n).
There are two things going on here in the “real world” that don’t correspond to algorithms. First, aligning the edges is a parallel operation. Every data item is also a processor (the laws of physics apply to it). So, in general, you scale the available processing with n, essentially dividing your classic complexity by a factor on n.
Second, how does aligning the edges accomplish a sort? The real sorting is in the read-out which lets you find the longest in one step, even though you did compare all of them to find the longest. Again, divide by a factor of n, so finding the largest is now O(1).
Another example is using analog computing: a physical model solves the problem “instantly” and the prep work is O(n). In principle the computation is scaling with the number of interacting components, not the number of prepped items. So the computation scales with n². The example I'm thinking of is a weighted multi-factor computation, which was done by drilling holes in a map, hanging weights from strings passing through the holes, and gathering all the strings on a ring.
Sorting is still O(n) total time. That it is faster than that is because of Parallelization.
You could view a centrifuge as a Bucketsort of n atoms, parallelized over n cores(each atom acts as a processor).
You can make sorting faster by parallelization but only by a constant factor because the number of processors is limited, O(n/C) is still O(n) (CPUs have usually < 10 cores and GPUs < 6000)
The centrifuge is not sorting the nodes, it applies applies a force to them then they react in parallel to it.
So if you were to implement a bubble sort where each node is moving itself in parallel up or down based on it's "density", you'd have a centrifuge implementation.
Keep in mind that in the real world you can run a very large amount of parallel tasks where in a computer you can have a maximum of real parallel tasks equals to the number of physical processing units.
In the end, you would also be limited with the access to the list of elements because it cannot be modified simultaneously by two nodes...
Would it be possible with some overhead on each node (some value or
method tacked on to each of the nodes) to 'force' the order of the
list?
When we sort using computer programs we select a property of the values being sorted. That's commonly magnitude of the number or the alphabetical order.
Something like a centrifuge, where only each element cares about its
relative position in space (in relation to other nodes)
This analogy aptly reminds me of simple bubble sort. How smaller numbers bubble up in each iteration. Like your centrifuge logic.
So to answer this, don't we actually do something of that sort in software based sorting?
First of all, you are comparing two different contexts, one is logic(computer) and the other is physics which (so far) is proven that we can model some parts of it using mathematical formulas and we as programmers can use this formulas to simulate (some parts of) physics in the logic work (e.g physics engine in game engine).
Second We have some possibilities in the computer (logic) world that is nearly impossible in physics for example we can access memory and find the exact location of each entity at each time but in physics that is a huge problem Heisenberg's uncertainty principle.
Third If you want to map centrifuges and its operation in real world, to computer world, it is like someone (The God) has given you a super-computer with all the rules of physics applied and you are doing your small sorting in it (using centrifuge) and by saying that your sorting problem was solved in o(n) you are ignoring the huge physics simulation going on in background...
Consider: is "centrifuge sort" really scaling better? Think about what happens as you scale up.
The test tubes have to get longer and longer.
The heavy stuff has to travel further and further to get to the bottom.
The moment of inertia increases, requiring more power and longer times to accelerate up to sorting speed.
It's also worth considering other problems with centrifuge sort. For example, you can only operate on a narrow size scale. A computer sorting algorithm can handle integers from 1 to 2^1024 and beyond, no sweat. Put something that weighs 2^1024 times as much as a hydrogen atom into a centrifuge and, well, that's a black hole and the galaxy has been destroyed. The algorithm failed.
Of course the real answer here is that computational complexity is relative to some computational model, as mentioned in other answer. And "centrifuge sort" doesn't make sense in the context of common computational models, such as the RAM model or the IO model or multitape Turing machines.
Another perspective is that what you're describing with the centrifuge is analogous to what's been called the "spaghetti sort" (https://en.wikipedia.org/wiki/Spaghetti_sort). Say you have a box of uncooked spaghetti rods of varying lengths. Hold them in your fist, and loosen your hand to lower them vertically so the ends are all resting on a horizontal table. Boom! They're sorted by height. O(constant) time. (Or O(n) if you include picking the rods out by height and putting them in a . . . spaghetti rack, I guess?)
You can note there that it's O(constant) in the number of pieces of spaghetti, but, due to the finite speed of sound in spaghetti, it's O(n) in the length of the longest strand. So nothing comes for free.

Simple ordering for a linked list

I want to create a doubly linked list with an order sequence (an integer attribute) such that sorting by the order sequence could create an array that would effectively be equivalent to the linked list.
given: a <-> b <-> c
a.index > b.index
b.index > c.index
This index would need to handle efficiently arbitrary numbers of inserts.
Is there a known algorithm for accomplishing this?
The problem is when the list gets large and the index sequence has become packed. In that situation the list has to be scanned to put slack back in.
I'm just not sure how this should be accomplished. Ideally there would be some sort of automatic balancing so that this borrowing is both fast and rare.
The naive solution of changing all the left or right indecies by 1 to make room for the insert is O(n).
I'd prefer to use integers, as I know numbers tend to get less reliable in floating point as they approach zero in most implementations.
This is one of my favorite problems. In the literature, it's called "online list labeling", or just "list labeling". There's a bit on it in wikipedia here: https://en.wikipedia.org/wiki/Order-maintenance_problem#List-labeling
Probably the simplest algorithm that will be practical for your purposes is the first one in here: https://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf.
It handles insertions in amortized O(log N) time, and to manage N items, you have to use integers that are big enough to hold N^2. 64-bit integers are sufficient in almost all practical cases.
What I wound up going for was a roll-my-own solution, because it looked like the algorithm wanted to have the entire list in memory before it would insert the next node. And that is no good.
My idea is to borrow some of the ideas for the algorithm. What I did was make Ids ints and sort orders longs. Then the algorithm is lazy, stuffing entries anywhere they'll fit. Once it runs out of space in some little clump somewhere it begins a scan up and down from the clump and tries to establish an even spacing such that if there are n items scanned they need to share n^2 padding between them.
In theory this will mean over time the list will be perfectly padded, and given that my IDs are ints and my sort orders are longs, there will never be a scenario where you will not be able to achieve n^2 padding. I can't speak to the upper bounds on the number of operations, but my guts tell me that by doing polynomial work at 1/polynomial frequency, that I'll be doing just fine.

Sort elements with the fewest comparisons possible

I have a folder full of images. There are too many to just 'rank'. I made a program that shows two at a time and let's the user pick which one of the two is better. At the end I would like all of the photos to be ordered from best to worst.
I am purely trying to optimize for the fewest amount of comparisons possible. I don't care if the program runs in n cubed time. I've read the other questions here with similar questions but I'm looking for something more advanced.
I'm thinking maybe some sort of algorithm that based on what comparisons you've already made, the program chooses two images to compare that will offer the most information. Maybe even an algorithm that makes complex connections to help determine the orders and potential orders.
Like I said I don't care if it is slow just purely trying to minimize comparisons
If total order exists, you need at least nlog2(n) comparisons. It can be easily proved mathematically. No way around. So regular sorting algorithms in nlog(n) will do the job.
What you are trying to do is called 'topological sort'. Google it and read about it in wikipedia. You can achieve partial sorts in less comparisons. Its kind of a graduate sort. The more comparisons you get, the better the result will be.
However, what do you do if no total order exists? Humans are not able to generate a total order for subjective tasks.
For example picture 1 is better than 2, 2 is better than 3 but 3 is better than 1.
In this case no sorting algorithm can produce a permutation which will match all the decisions. During topological sort, you can detect those inconsitent decisions and get rid of them.
You are looking for a sorting algorithm - pick one. Most algorithms just need a comparison function (a < b?). This is when you show the user two pictures and he has to choose the better one.
You might wan't to read trough some of the algorithms and choose the best one for you. E.g. on quicksort, you would pick a random picture and the user have to compare this picture against all other pictures in the first round - might be too boring from the end user perspective.

Finding the average of large list of numbers

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?
If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.
First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

Looking for a multidimensional optimization algorithm

Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.

Resources