Lets say I have N objects and each of them has associated values A and B. This could be represented as a list of tuples like:
[(3,10), (8,4), (0,0), (20,7),...]
where each tuple is an object and the two values are A and B.
What I want to do is select M of these objects (where M < N) such that the sums of A and B in the selected subset is as balanced as possible. M here is a parameter of the problem, I don't want to find the optimal M. I want to be able to say "give me 100 objects, and make them as balanced as possible".
Any idea if there an efficient algorithm which can solve this problem (not necessarily completely optimally)? I think this might be related to bin-packing, but I'm not really sure.
This is a disguised variant of subset-sum. Replace each (A,B) by A-B, and then the absolute value of the sum of all selected A-B values is the "unbalancedness" of the sums. So you really have to select M of those scalars and try to have a sum as close to 0 as possible.
The "variant" bit is because you have to select exactly M items. (I think this is why your mind went to bin-packing rather than subset-sum.) If you have a black-box subset-sum solver you can account for this too: if the maximum single-pair absolute difference is D, replace each (A,B) by (A-B+D) and have the target sum be M*D. (Don't do that if you're doing a dynamic programming approach, of course, since it increases the magnitude of the numbers you're working with.)
Presuming that you're fine with an approximation (and if you're not, you're gonna have a real bad day) I would tend to use Simulated Annealing or Late Acceptance Hill Climbing as a basic approach, starting with a greedy initial solution (iteratively add whichever object results in the minimal difference), and then in each round, considering randomly replacing one object by one not-currently-selected object.
Related
I don't generally ask questions on SO, so if this question seems inappropriate for SO, just tell me (help would still be appreciated of course).
I'm still a student and I'm currently taking a class in Algorithms. We recently learned about the Branch-and-Bound paradigm and since I didn't fully understand it, I tried to do some exercises in our course book. I came across this particular instance of the Set Cover problem with a special twist:
Let U be a set of elements and S = {S1, S2, ..., Sn} a set of subsets of U, where the union of all sets Si equals U. Outline a Branch-and-Bound algorithm to find a minimal subset Q of S, so that for all elements u in U there are at least two sets in Q, which contain u. Specifically, elaborate how to split the problem up into subproblems and how to calculate upper and lower bounds.
My first thought was to sort all the sets Si in S in descending order, according to how many elements they contain which aren't yet covered at least twice by the currently chosen subsets of S, so our current instance of Q. I was then thinking of recursively solving this, where I choose the first set Si in the sorted order and make one recursive call, where I take this set Si and one where I don't (meaning from those recursive calls onwards the subset is no longer considered). If I choose it I would then go through each element in this chosen subset Si and increase a counter for all its elements (before the recursive call), so that I'll eventually know, when an element is already covered by two or more chosen subsets. Since I sort the not chosen sets Si for each recursive call, I would theoretically (in my mind at least) always be making the best possible choice for the moment. And since I basically create a binary tree of recursive calls, because I always make one call with the current best subset chosen and one where I don't I'll eventually cover all 2^n possibilities, meaning eventually I'll find the optimal solution.
My problem now is I don't know or rather understand how I would implement a heuristic for upper and lower bounds, so the algorithm can discard some of the paths in the binary tree, which will never be better than the current best Q. I would appreciate any help I could get.
Here's a simple lower bound heuristic: Find the set containing the largest number of not-yet-twice-covered elements. (It doesn't matter which set you pick if there are multiple sets with the same, largest possible number of these elements.) Suppose there are u of these elements in total, and this set contains k <= u of them. Then, you need to add at least u/k further sets before you have a solution. (Why? See if you can prove this.)
This lower bound works for the regular set cover problem too. As always with branch and bound, using it may or may not result in better overall performance on a given instance than simply using the "heuristic" that always returns 0.
First, some advice: don't re-sort S every time you recurse/loop. Sorting is an expensive operation (O(N log N)) so putting it in a loop or a recursion usually costs more than you gain from it. Generally you want to sort once at the beginning, and then leverage that sort throughout your algorithm.
The sort you've chosen, descending by the length of the S subsets is a good "greedy" ordering, so I'd say just do that upfront and don't re-sort after that. You don't get to skip over subsets that are not ideal within your recursion, but checking a redundant/non-ideal subset is still faster than re-sorting every time.
Now what upper/lower bounds can you use? Well generally, you want your bounds and bounds-checking to be as simple and efficient as possible because you are going to be checking them a lot.
With this in mind, an upper bounds is easy: use the shortest set-length solution that you've found so far. Initially set your upper-bounds as var bestQlength = int.MaxVal, some maximum value that is greater than n, the number of subsets in S. Then with every recursion you check if currentQ.length > bestQlength, if so then this branch is over the upper-bounds and you "prune" it. Obviously when you find a new solution, you also need to check if it is better (shorter) than your current bestQ and if so then update both bestQ and bestQlength at the same time.
A good lower bounds is a bit trickier, the simplest I can think of for this problem is: Before you add a new subset Si into your currentQ, check to see if Si has any elements that are not already in currentQ two or more times, if it does not, then this Si cannot contribute in any way to the currentQ solution that you are trying to build, so just skip it and move on to the next subset in S.
Input: An array of elements and a partial order on a subset of those elements, seen as a constraint set.
Output: An array (or any ordered sequence) fulfilling the partial order.
The Problem: How can one achieve the reordering efficiently? The number of introduced inversions (or swaps) compared to the original input sequence should be as small as possible. Note, that the partial order can be defined for any amount of elements (some elements may are not part of it).
The context: It arises from a situation in 2-layer graph crossing reduction: After a crossing reduction phase, I want to reorder some of the nodes (thus, the partial order may contain only a small subset).
In general, I had the idea to weaken this a little bit and solve the problem only for the elements being part of the partial order (though I think, that this could lead to non-optimal results). Thus, if I have a sequence A B C D E and the partial order only contains A, B and E, then C and D will stay at the same place. It somehow reminds me of the Kemeny Score, but I couldn't yet turn that into an algorithm.
Just to be sure: I am not searching for a topological sort. This would probably introduce a lot more inversions than required.
Edit 1:
Changed wording (sequence to array).
The amount of additional space for solving the problem can be arbitrary (well, polynomially bounded) large. Of course, less is better :) So, something like O(ArrayLen*ArrayLen) at most would be fantastic.
Why the min amount of swaps or inversions: As this procedure is part of crossing reduction, the input array's ordering is (hopefully) close to an optimum, in terms of edge crossings with the second node layer. Then, every additional swap or inversion would, probably, introduce edge crossings again. But in the process of computing the output, the number of swaps or movements done is not really important (though, again, something linear or quadratical would be cool), as only the output-quality is important. Right now, I require the constraints to be in a total order and only inspect the nodes of that order, thus it becomes trivial to solve. But the partial order constraints would be more flexible.
I found a paper, which looks promising: "A Fast and Simple Heuristic for Constrained Two-Level Crossing Reduction" by Michael Foster.
Together with the comments below my question, it is answered. Thanks again, #j_random_hacker!
Suppose that I have an object that has N different scalar qualities, each of which I've measured (for example, the (x,y) coordinates at the tips of the major arms of a leaf). Together, I have N such measurements for each object, which I'll save as a 1D list of N reals.
Now I'm given a large number R of such objects, each with its corresponding N-element list. Let's call this the population. We can represent this as a matrix M with R rows, each of N elements.
I'm now given a new object B, with its 1D N-element list. I'd like to hand Mathematica my matrix M and my new object B, and get back a single number that tells me how confident I can be that B belongs to the population represented by M.
I'd also be happy with a probability, or any other number with a simple interpretation. I'm willing to assume that everything is uncorrelated, that the values in columns of M are normally distributed, and other such typical assumptions.
When N=1, Student's t-test seems the right tool. There seem to be tools built into Mathematica that can solve precisely this problem when N>1, but the documentation (and web references) presume more statistical depth than I have, so I don't have confidence that I know what to do. I feel like the solution is tantalizingly just out of reach. If anyone can provide a code example that solves this problem, I would be very grateful.
I have N binary sequences of length L, where N and L maybe very large, and those sequences maybe very sparse, say have much more 0s then 1s.
I want to select M sequences from them, namely b_1, b_2, b_3..., such that
b_1 | b_2 | b_3 ... | b_M = 1111...11 (L 1s)
Is there an algorithm to achieve it?
My idea is:
STEP1: for position from 1 to L, count the total number of sequences which has 1 at that position. Name it 'owning number'
STEP2: consider the position having minimum owning number, and choose the sequence having the maximum number of 1s from the owning sequence of that position.
STEP3: ignore the chosen sequence, update owning number and go back to STEP2.
I believe that my method cannot generate the best answer.
Does anyone has a better idea?
This is the well known set cover problem. It is NP-hard — in fact, its decision version is one of the canonical NP-complete problems and was among the 21 problems included in Karp's 1972 paper — and so no efficient algorithm is known for solving it.
The algorithm you describe in your question is known as the "greedy algorithm" and (unless your problem has some special features that you are not telling us) it's essentially the best known approach. It finds a collection of sets that is no more than O(log |N|) times the size of the smallest such collection.
Sounds like a typical backtrack task.
Yes, your algoryth sounds reasonable if you want to have a good answer quickly. If you want to have the combination of the least possible samples you can't do better than try all combinations.
Depending on the exact structure of the problem, there is an other technique that often works well (and actually gives an optimal result):
Let x[j] be a boolean variable representing the choice whether to include the j'th binary sequence in the result. A zero-suppressed binary decision diagram can now represent (maybe succinctly - depending on the characteristics of the problem) the family of sets such that the OR of the binary sequences corresponding to a variable x[j] included in the set is all ones. Finding the smallest such set (thus minimizing the number of sequences included) is relatively easy if the ZDD was succinct. Details can be found in The Art of Computer Programming chapter 7.1.4 (volume 4A).
It's also easy to adapt to an exact cover, by taking the family of sets such that there is exactly one 1 for every position.
My best shot so far:
A delivery vehicle needs to make a series of deliveries (d1,d2,...dn), and can do so in any order--in other words, all the possible permutations of the set D = {d1,d2,...dn} are valid solutions--but the particular solution needs to be determined before it leaves the base station at one end of the route (imagine that the packages need to be loaded in the vehicle LIFO, for example).
Further, the cost of the various permutations is not the same. It can be computed as the sum of the squares of distance traveled between di -1 and di, where d0 is taken to be the base station, with the caveat that any segment that involves a change of direction costs 3 times as much (imagine this is going on on a railroad or a pneumatic tube, and backing up disrupts other traffic).
Given the set of deliveries D represented as their distance from the base station (so abs(di-dj) is the distance between two deliveries) and an iterator permutations(D) which will produce each permutation in succession, find a permutation which has a cost less than or equal to that of any other permutation.
Now, a direct implementation from this description might lead to code like this:
function Cost(D) ...
function Best_order(D)
for D1 in permutations(D)
Found = true
for D2 in permutations(D)
Found = false if cost(D2) > cost(D1)
return D1 if Found
Which is O(n*n!^2), e.g. pretty awful--especially compared to the O(n log(n)) someone with insight would find, by simply sorting D.
My question: can you come up with a plausible problem description which would naturally lead the unwary into a worse (or differently awful) implementation of a sorting algorithm?
I assume you're using this question for an interview to see if the applicant can notice a simple solution in a seemingly complex question.
[This assumption is incorrect -- MarkusQ]
You give too much information.
The key to solving this is realizing that the points are in one dimension and that a sort is all that is required. To make this question more difficult hide this fact as much as possible.
The biggest clue is the distance formula. It introduces a penalty for changing directions. The first thing an that comes to my mind is minimizing this penalty. To remove the penalty I have to order them in a certain direction, this ordering is the natural sort order.
I would remove the penalty for changing directions, it's too much of a give away.
Another major clue is the input values to the algorithm: a list of integers. Give them a list of permutations, or even all permutations. That sets them up to thinking that a O(n!) algorithm might actually be expected.
I would phrase it as:
Given a list of all possible
permutations of n delivery locations,
where each permutation of deliveries
(d1, d2, ...,
dn) has a cost defined by:
Return permutation P such that the
cost of P is less than or equal to any
other permutation.
All that really needs to be done is read in the first permutation and sort it.
If they construct a single loop to compare the costs ask them what the big-o runtime of their algorithm is where n is the number of delivery locations (Another trap).
This isn't a direct answer, but I think more clarification is needed.
Is di allowed to be negative? If so, sorting alone is not enough, as far as I can see.
For example:
d0 = 0
deliveries = (-1,1,1,2)
It seems the optimal path in this case would be 1 > 2 > 1 > -1.
Edit: This might not actually be the optimal path, but it illustrates the point.
YOu could rephrase it, having first found the optimal solution, as
"Give me a proof that the following convination is the most optimal for the following set of rules, where optimal means the smallest number results from the sum of all stage costs, taking into account that all stages (A..Z) need to be present once and once only.
Convination:
A->C->D->Y->P->...->N
Stage costs:
A->B = 5,
B->A = 3,
A->C = 2,
C->A = 4,
...
...
...
Y->Z = 7,
Z->Y = 24."
That ought to keep someone busy for a while.
This reminds me of the Knapsack problem, more than the Traveling Salesman. But the Knapsack is also an NP-Hard problem, so you might be able to fool people to think up an over complex solution using dynamic programming if they correlate your problem with the Knapsack. Where the basic problem is:
can a value of at least V be achieved
without exceeding the weight W?
Now the problem is a fairly good solution can be found when V is unique, your distances, as such:
The knapsack problem with each type of
item j having a distinct value per
unit of weight (vj = pj/wj) is
considered one of the easiest
NP-complete problems. Indeed empirical
complexity is of the order of O((log
n)2) and very large problems can be
solved very quickly, e.g. in 2003 the
average time required to solve
instances with n = 10,000 was below 14
milliseconds using commodity personal
computers1.
So you might want to state that several stops/packages might share the same vj, inviting people to think about the really hard solution to:
However in the
degenerate case of multiple items
sharing the same value vj it becomes
much more difficult with the extreme
case where vj = constant being the
subset sum problem with a complexity
of O(2N/2N).
So if you replace the weight per value to distance per value, and state that several distances might actually share the same values, degenerate, some folk might fall in this trap.
Isn't this just the (NP-Hard) Travelling Salesman Problem? It doesn't seem likely that you're going to make it much harder.
Maybe phrasing the problem so that the actual algorithm is unclear - e.g. by describing the paths as single-rail railway lines so the person would have to infer from domain knowledge that backtracking is more costly.
What about describing the question in such a way that someone is tempted to do recursive comparisions - e.g. "can you speed up the algorithm by using the optimum max subset of your best (so far) results"?
BTW, what's the purpose of this - it sounds like the intent is to torture interviewees.
You need to be clearer on whether the delivery truck has to return to base (making it a round trip), or not. If the truck does return, then a simple sort does not produce the shortest route, because the square of the return from the furthest point to base costs so much. Missing some hops on the way 'out' and using them on the way back turns out to be cheaper.
If you trick someone into a bad answer (for example, by not giving them all the information) then is it their foolishness or your deception that has caused it?
How great is the wisdom of the wise, if they heed not their ego's lies?