Splitting a set of object into several subsets according to certain evaluation - algorithm

Suppose I have a set of objects, S. There is an algorithm f that, given a set S builds certain data structure D on it: f(S) = D. If S is large and/or contains vastly different objects, D becomes large, to the point of being unusable (i.e. not fitting in allotted memory). To overcome this, I split S into several non-intersecting subsets: S = S1 + S2 + ... + Sn and build Di for each subset. Using n structures is less efficient than using one, but at least this way I can fit into memory constraints. Since size of f(S) grows faster than S itself, combined size of Di is much less than size of D.
However, it is still desirable to reduce n, i.e. the number of subsets; or reduce the combined size of Di. For this, I need to split S in such a way that each Si contains "similar" objects, because then f will produce a smaller output structure if input objects are "similar enough" to each other.
The problems is that while "similarity" of objects in S and size of f(S) do correlate, there is no way to compute the latter other than just evaluating f(S), and f is not quite fast.
Algorithm I have currently is to iteratively add each next object from S into one of Si, so that this results in the least possible (at this stage) increase in combined Di size:
for x in S:
i = such i that
size(f(Si + {x})) - size(f(Si))
is min
Si = Si + {x}
This gives practically useful results, but certainly pretty far from optimum (i.e. the minimal possible combined size). Also, this is slow. To speed up somewhat, I compute size(f(Si + {x})) - size(f(Si)) only for those i where x is "similar enough" to objects already in Si.
Is there any standard approach to such kinds of problems?
I know of branch and bounds algorithm family, but it cannot be applied here because it would be prohibitively slow. My guess is that it is simply not possible to compute optimal distribution of S into Si in reasonable time. But is there some common iteratively improving algorithm?
As comments noted, I never defined "similarity". In fact, all I want is to split in such subsets Si that combined size of Di = f(Si) is minimal or at least small enough. "Similarity" is defined only as this and unfortunately simply cannot be computed easily. I do have a simple approximation, but it is only that — an approximation.
So, what I need is a (likely heuristical) algorithm that minimizes sum f(Si) given that there is no simple way to compute the latter — only approximations I use to throw away cases that are very unlikely to give good results.

About the slowness I found that in similar problems a good-enough solution is to compute the match just by picking a fixed number of random candidates.
True that the result will not be the best one (often worse than the full "greedy" solution you implemented) but it in my experience not too bad and you can decide the speed... it can even be implemented in a prescribed amount of time (that is you keep searching until the allocated time expires).
Another option I use is to keep searching until I see no improvement for a while.
To get past the greedy logic you could keep a queue of N "x" elements and trying to pack them simultaneously in groups of "k" (with k < N).
In this case I found that is important to also keep the "age" of an element in the queue and to use it as a "prize" for the result to avoid keeping "bad" elements forever in the queue because others will always match better (this would make the queue search useless and the results would be basically the same as the greedy approach).


Ideas for heuristically solving travelling salesman with extra constraints

I'm trying to come up with a fast and reasonably optimal algorithm to solve the following TSP/hamiltonian-path-like problem:
A delivery vehicle has a number of pickups and dropoffs it needs to
For each delivery, the pickup needs to come before the
The vehicle is quite small and the packages vary in size.
The total carriage cannot exceed some upper bound (e.g. 1 cubic
metre). Each delivery has a deadline.
The planner can run mid-route, so the vehicle will begin with a number of jobs already picked up and some capacity already taken up.
A near-optimal solution should minimise the total cost (for simplicity, distance) between each waypoint. If a solution does not exist because of the time constraints, I need to find a solution that has the fewest number of late deliveries. Some illustrations of an example problem and a non-optimal, but valid solution:
I am currently using a greedy best first search with backtracking bounded to 100 branches. If it fails to find a solution with on-time deliveries, I randomly generate as many as I can in one second (the most computational time I can spare) and pick the one with the fewest number of late deliveries. I have looked into linear programming but can't get my head around it - plus I would think it would be inappropriate given it needs to be run very frequently. I've also tried algorithms that require mutating the tour, but the issue is mutating a tour nearly always makes it invalid due to capacity constraints and precedence. Can anyone think of a better heuristic approach to solving this problem? Many thanks!
Safe Moves
Here are some ideas for safely mutating an existing feasible solution:
Any two consecutive stops can always be swapped if they are both pickups, or both deliveries. This is obviously true for the "both deliveries" case; for the "both pickups" case: if you had room to pick up A, then pick up B without delivering anything in between, then you have room to pick up B first, then pick up A. (In fact a more general rule is possible: In any pure-delivery or pure-pickup sequence of consecutive stops, the stops can be rearranged arbitrarily. But enumerating all the possibilities might become prohibitive for long sequences, and you should be able to get most of the benefit by considering just pairs.)
A pickup of A can be swapped with any later delivery of something else B, provided that A's original pickup comes after B was picked up, and A's own delivery comes after B's original delivery. In the special case where the pickup of A is immediately followed by the delivery of B, they can always be swapped.
If there is a delivery of an item of size d followed by a pickup of an item of size p, then they can be swapped provided that there is enough extra room: specifically, provided that f >= p, where f is the free space available before the delivery. (We already know that f + d >= p, otherwise the original schedule wouldn't be feasible -- this is a hint to look for small deliveries to apply this rule to.)
If you are starting from purely randomly generated schedules, then simply trying all possible moves, greedily choosing the best, applying it and then repeating until no more moves yield an improvement should give you a big quality boost!
Scoring Solutions
It's very useful to have a way to score a solution, so that they can be ordered. The nice thing about a score is that it's easy to incorporate levels of importance: just as the first digit of a two-digit number is more important than the second digit, you can design the score so that more important things (e.g. deadline violations) receive a much greater weight than less important things (e.g. total travel time or distance). I would suggest something like 1000 * num_deadline_violations + total_travel_time. (This assumes of course that total_travel_time is in units that will stay beneath 1000.) We would then try to minimise this.
Managing Solutions
Instead of taking one solution and trying all the above possible moves on it, I would instead suggest using a pool of k solutions (say, k = 10000) stored in a min-heap. This allows you to extract the best solution in the pool in O(log k) time, and to insert new solutions in the same time.
You could initially populate the pool with randomly generated feasible solutions; then on each step, you would extract the best solution in the pool, try all possible moves on it to generate child solutions, and insert any child solutions that are better than their parent back into the pool. Whenever the pool doubles in size, pull out the first (i.e. best) k solutions and make a new min-heap with them, discarding the old one. (Performing this step after the heap grows to a constant multiple of its original size like this has the nice property of leaving the amortised time complexity unchanged.)
It can happen that some move on solution X produces a child solution Y that is already in the pool. This wastes memory, which is unfortunate, but one nice property of the min-heap approach is that you can at least handle these duplicates cheaply when they arrive at the front of the heap: all duplicates will have identical scores, so they will all appear consecutively when extracting solutions from the top of the heap. Thus to avoid having duplicate solutions generate duplicate children "down through the generations", it suffices to check that the new top of the heap is different from the just-extracted solution, and keep extracting and discarding solutions until this holds.
A note on keeping worse solutions: It might seem that it could be worthwhile keeping child solutions even if they are slightly worse than their parent, and indeed this may be useful (or even necessary to find the absolute optimal solution), but doing so has a nasty consequence: it means that it's possible to cycle from one solution to its child and back again (or possibly a longer cycle). This wastes CPU time on solutions we have already visited.
You are basically combining the Knapsack Problem with the Travelling Salesman Problem.
Your main problem here seems to be actually the Knapsack Problem, rather then the Travelling Salesman Problem, since it has the one hard restriction (maximum delivery volume). Maybe try to combine the solutions for the Knapsack Problem with the Travelling Salesman.
If you really only have one second max for calculations a greedy algorithm with backtracking might actually be one of the best solutions that you can get.

Most effective Algorithm to find maximum of double-precision values

What is the most effective way of finding a maximum value in a set of variables?
I have seen solutions, such as
private double findMax(double... vals) {
double max = Double.NEGATIVE_INFINITY;
for (double d : vals) {
if (d > max) max = d;
return max;
But, what would be the most effective algorithm for doing this?
You can't reduce the complexity below O(n) if the list is unsorted... but you can improve the constant factor by a lot. Use SIMD. For example, in SSE you would use the MAXSS instruction to perform 4-ish compare+select operations in a single cycle. Unroll the loop a bit to reduce the cost of loop control logic. And then outside the loop, find the max out of the four values trapped in your SSE register.
This gives a benefit for any size list... also using multithreading makes sense for really large lists.
Assuming the list does not have elements in any particular order, the algorithm you mentioned in your question is optimal. It must look at every element once, thus it takes time directly proportional to the to the size of the list, O(n).
There is no algorithm for finding the maximum that has a lower upper bound than O(n).
Proof: Suppose for a contradiction that there is an algorithm that finds the maximum of a list in less than O(n) time. Then there must be at least one element that it does not examine. If the algorithm selects this element as the maximum, an adversary may choose a value for the element such that it is smaller than one of the examined elements. If the algorithm selects any other element as the maximum, an adversary may choose a value for the element such that it is larger than the other elements. In either case, the algorithm will fail to find the maximum.
EDIT: This was my attempt answer, but please look at the coments where #BenVoigt proposes a better way to optimize the expression
You need to traverse the whole list at least once
so it'd be a matter of finding a more efficient expression for if (d>max) max=d, if any.
Assuming we need the general case where the list is unsorted (if we keep it sorted we'd just pick the last item as #IgnacioVazquez points in the comments), and researching a little about branch prediction (Why is it faster to process a sorted array than an unsorted array? , see 4th answer) , looks like
if (d>max) max=d;
can be more efficiently rewritten as
The reason is, the first statement is normally translated into a branch (though it's totally compiler and language dependent, but at least in C and C++, and even in a VM-based language like Java happens) while the second one is translated into a conditional move.
Modern processors have a big penalty in branches if the prediction goes wrong (the execution pipelines have to be reset), while a conditional move is an atomic operation that doesn't affect the pipelines.
The random nature of the elements in the list (one can be greater or lesser than the current maximum with equal probability) will cause many branch predictions to go wrong.
Please refer to the linked question for a nice discussion of all this, together with benchmarks.

Is the linear formation the best sorting production?

Considering usually a sorting method products linearly sorted productions (such as "1,7,8,13,109..."), which consumes O(N) to inquiry.
Why not sort in non-linear order, consuming O(logN) or something to find element(s) by iteration or Newton method etc.? Is it expensive to make such a high-order sorted structure?
Concisely, is it a possible idea to sort results which allowed to be accessed by finding roots for ax^2 + bx + c = 0? (for contrast, usually it's finding root for ax + c = 0.) For example, we have x1 = 1, x2 = 2 as roots of a quadratic equation and just insert following xi(s). Then it is possible to use smarter ways to inquiry.
I suppose difficulty can be encountered by these aspects:
prediction of data can be rather hard. thus we cannot construct a general formula to describe well the following numbers (may be hash values).
due to the first difficulty, numbers out of certain range can be divergent. example graphed by Google:the graph. the values derived out of [-1,3] are really large, as well as rapid increment in difficulty executing the original formula.
that is actually equivalent to hash, which creates a table that contains the values. and the production rule is a formula.
the execution of a "smarter" inquiry may be expensive because of the complexity of algorithm itself.
Smarter schemes which take advantage of a known statistical distribution are typically faster by some constant. However, that still keeps them at O(log N), which is the same as a trivial binary search. The reason is that in each step, they typically narrow down the range of elements to search by a factor R > 2 , for simple binary search that's just R=2. But you need log(N)/log(R) steps to narrow it down to exactly one element.
Now whether this is a net win depends on log(R) versus the work needed at each step. A simple comparison (for binary search) takes a few cycles. As soon as you need anything more complex than +-*/ (say exp or log) to predict the location of the next element, the profit of needing less steps is gone.
So, in summary: binary search is used because each step is efficient, for many real-world distributions.

Finding the average of large list of numbers

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?
If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.
First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

finding closest hamming distance

I have N < 2^n randomly generated n-bit numbers stored in a file the lookup for which is expensive. Given a number Y, I have to search for a number in the file that is at most k hamming dist. from Y. Now this calls for a C(n 1) + C(n 2) + C(n 3)...+C(n,k) worst case lookups which is not feasible in my case. I tried storing the distribution of 1's and 0's at each bit position in memory and prioritized my lookups. So, I stored probability of bit i being 0/1:
Pr(bi=0), Pr(bi=1) for all i from 0 to n-1.
But it didn't help much since N is too large and have almost equal distribution of 1/0 in every bit location. Is there a way this thing can be done more efficiently. For now, you can assume n=32, N = 2^24.
Google gives a solution to this problem for k=3, n=64, N=2^34 (much larger corpus, fewer bit flips, larger fingerprints) in this paper. The basic idea is that for small k, n/k is quite large, and hence you expect that nearby fingerprints should have relatively long common prefixes if you formed a few tables with permuted bits orders. I am not sure it will work for you, however, since your n/k is quite a bit smaller.
If by "lookup", you mean searching your entire file for a specified number, and then repeating the "lookup" for each possible match, then it should be faster to just read through the whole file once, checking each entry for the hamming distance to the specified number as you go. That way you only read through the file once instead of C(n 1) + C(n 2) + C(n 3)...+C(n,k) times.
You can use quantum computation for speeding up your search process and at the same time minimizing the required number of steps. I think Grover's search algorithm will be help full to you as it provides quadratic speed up to the search problem.....
Perhaps you could store it as a graph, with links to the next closest numbers in the set, by hamming distance, then all you need to do is follow one of the links to another number to find the next closest one. Then use an index to keep track of where the numbers are by file offset, so you don't have to search the graph for Y when you need to find its nearby neighbors.
You also say you have 2^24 numbers, which according to wolfram alpha (http://www.wolframalpha.com/input/?i=2^24+*+32+bits) is only 64MB. Could you just put it all in ram to make the accesses faster? Perhaps that would happen automatically with caching on your machine?
If your application can afford to do some extensive preprocessing, you could, as you're generating the n-bit numbers, compute all the other numbers which are at most k distant from that number and store it in a lookup table. It'd be something like a Map >. riri claims you can fit it in memory, so hash tables might work well, but otherwise, you'd probably need a B+ tree for the Map. Of course, this is expensive as you mentioned before, but if you can do it beforehand, you'd have fast lookups later, either O(1) or O(log(N) + log(2^k)).
