This problem came up in the real world, but I've translated it into a more generic "textbook-like" formulation. I suspect it is NP, but I'm particularly interested in knowing if it has a name or is well known since I think I can't be the first one to encounter it. ;-)
Imagine there is a potluck party with N guests. Each guest may bring his/her "signature dish" to the party, or bring nothing. Each guest either likes or hates each of the dishes that the other guests may bring (and this is known in advance since they are all old friends!), but they all like their own dishes.
Is there a deterministic algorithm that does not take exponential time to find the smallest set of dishes that satisfies the constraint that all guests will find at least one dish to their liking? I say "the" smallest, but actually there may be multiple solutions, and I'd like to know them all if possible.
Or, in a more abstract way, imagine a square matrix where all elements are either 0 or 1, and all diagonal elements are 1. What are the smallest sets of rows such that the sum (or the binary OR) of the rows in each set have no zeroes? (The rows would be the dishes, the columns would be the guests, 1 would mean that a guest likes a dish, and diagonal elements are 1 since everyone likes their own dish.)
This could be generalized to non-square matrices, or by removing diagonal=1 rule (although the latter guarantees that there will always be at least one solution). But I don't care about those cases for now...
I already have a program that solves it through an exhaustive search and is fast enough for N around 20, but it takes exponential time. I'm thinking I may need to recur to stochastic algorithms to find good-enough solutions for larger values of N.
Added
Wow, thanks for the quick answer! "Set cover", that's the name I was looking for. :)
This is called the SET COVER problem and it is NP-complete.
The set cover problem, as described in the Wikipedia article which Antti Huima linked to, lacks the feature of each guest liking his own dish. Offhand, I don't know whether this makes any difference.
Related
I am looking for a faster way to solve this problem:
Let's suppose we have n boxes and n marbles (each of them has a different kind). Every box can contain only some kinds of marbles (It is shown it the example below), and only one marble fits inside one box. Please read the edits. The whole algorithm has been described in the post linked below but it was not precisely described, so I am asking for a reexplenation.
The question is: In how many ways can I put marbles inside the boxes in polynomial time?
Example:
n=3
Marbles: 2,5,3
Restrictions of the i-th box (i-th box can only contain those marbles): {5,2},{3,5,2},{3,2}
The answer: 3, because the possible positions of the marbles are: {5,2,3},{5,3,2},{2,5,3}
I have a solution which works in O(2^n), but it is too slow. There are also one limitation about the boxes tolerance, which I don't find very important, but I will write them also. Each box has it's own kind-restrictions, but there is one list of kinds which is accepted by all of them (in the example above this widely accepted kind is 2).
Edit: I have just found this question but I am not sure if it works in my case, and the dynamic solution is not well described. Could somebody clarify this? This question was answered 4 years ago, so I won't ask it there. https://math.stackexchange.com/questions/2145985/how-to-compute-number-of-combinations-with-placement-restrictions?rq=1
Edit#2: I also have to mention that excluding widely-accepted list the maximum size of the acceptance list of a box has 0 1 or 2 elements.
Edit#3: This question refers to my previos question(Allowed permutations of numbers 1 to N), which I found too general. I am attaching this link because there is also one more important information - the distance between boxes in which a marble can be put isn't higher than 2.
As noted in the comments, https://cs.stackexchange.com/questions/19924/counting-and-finding-all-perfect-maximum-matchings-in-general-graphs is this problem, with links to papers on how to tackle it, and counting the number of matchings is #P-complete. I would recommend finding those papers.
As for dynamic programming, simply write a recursive solution and then memoize it. (That's top down, and is almost always the easier approach.) For the stack exchange problem with a fixed (and fairly small) number of boxes, that approach is manageable. Unfortunately in your variation with a large number of boxes, the naive recursive version looks something like this (untested, probably buggy):
def solve (balls, box_rules):
ball_can_go_in = {}
for ball in balls:
ball_can_go_in[ball] = set()
for i in range(len(box_rules)):
for ball in box_rules[i]:
ball_can_go_in[ball].add(i)
def recursive_attempt (n, used_boxes):
if n = len(balls):
return 1
else:
answer = 0
for box in ball_can_go_in[balls[n]]:
if box not in used_boxes:
used_boxes.add(box)
answer += recursive_attempt(n+1, used_boxes)
used_boxes.remove(box)
return answer
return recursive_attempt(0, set())
In order to memoize it you have to construct new sets, maybe use bit strings, BUT you're going to find that you're calling it with subsets of n things. There are an exponential number of them. Unfortunately this will take exponential time AND use exponential memory.
If you replace the memoizing layer with an LRU cache, you can control how much memory it uses and probably still get some win from the memoizing. But ultimately you will still use exponential or worse time.
If you go that route, one practical tip is sort the balls by how many boxes they go in. You want to start with the fewest possible choices. Since this is trying to reduce exponential complexity, it is worth quite a bit of work on this sorting step. So I'd first pick the ball that goes in the fewest boxes. Then I'd next pick the ball that goes in the fewest new boxes, and break ties by fewest overall. The third ball will be fewest new boxes, break ties by fewest boxes not used by the first, break ties by fewest boxes. And so on.
The idea is to generate and discover forced choices and conflicts as early as possible. In fact this is so important that it is worth a search at every step to try to discover and record forced choices and conflicts that are already visible. It feels counterintuitive, but it really does make a difference.
But if you do all of this, the dynamic programming approach that was just fine for 5 boxes will become faster, but you'll still only be able to handle slightly larger problems than a naive solution. So go look at the research for better ideas than this dynamic programming approach.
(Incidentally the inclusion-exclusion approach has a term for every subset, so it also will blow up exponentially.)
I wasn't entirely sure the best way to ask this question (or do the research to see if it has been previously answered).
Given a data set where each entry has a Point value and a Dollar value, I'm looking to generate a list of length N entries that yields the highest aggregate Point value whilst staying within budget B.
Example data set:
Item Points Dollars
Apple 3.0 $1.00
Pear 2.5 $0.75
Peach 2.8 $0.88
And with this (small) data set, say my budget (B) is $2.25, and list length (N) must be 2. You MUST use the fixed list length, but are not required to use ALL of the budget.
Obviously the example provided is easy to do in one's head, but given a much larger data set, and both higher N and B values, I'm looking for an algorithm that can generate the list. Having a hard time wrapping my head around this one.
Just looking for a pseudo-algorithm, but if you prefer any given language feel free to respond with that!
I am quite positive that this can be reduced to an NP-complete problem and hence it's not really worth trying to develop a process that will always give you the 'correct' answer as many people have tried and failed to do this efficiently over a large data set. However, you can use a much more efficient approximation technique that whilst it will not guarantee to give you the correct answer, many popular approximation algorithms are capable of achieving a high degree of accuracy.
Hope this helps you out :)
This problem is NP-Complete (NP and NP-Hard), meaning, that until now there is no algorithm found, that solves this problem in a polynomial amount time (polynomial to the input size) and if you find an algorithm that does, you would have solved one of the greatest problems in computer science (P=NP), which would you at least bring a million dollar reward.
If you are satisfied with an approximation, I would recommend the Greedy-Algorithm:
https://en.wikipedia.org/wiki/Greedy_algorithm
I have a problem that I have a number of questions about. First, I'm mostly looking for help describing and understanding the problem at hand. Solutions are always welcome, but most importantly I could use some advice from someone more experienced than I. Now, to the problem at hand:
I have a set of orders that each require some number of items. I also have several groupings of items that each contain some number of some items (call them groups). The goal is to find a subset of the orders that can be fulfilled using as few groups as possible and where the total number of items contained within the orders is between n and N.
Edit: The constraints on the number of items contained in the orders (n and N) are chosen independently.
To me at least, that's a really complicated way of saying the problem so I've been trying to re-phrase it as a knapsack problem (I suspect this might reduce to a subset-sum). To help my conceptual understanding of this I've started using the following definitions:
First, lets say that a dimension exists for each possible item, and somethings 'length' in that dimension is the number of that particular type of item it either has or requires.
From this, an order becomes an 'n-dimensional object' where its value in each dimension corresponds to the number of that item that it requires.
In addition, a group can be seen as an 'n-dimensional box' that has space in each dimension corresponding to the number of items it provides.
An objects value is equal to the sum of its length in all dimensions.
Boxes can be combined.
Given the above I've rephrased the problem to this:
What is the smallest combination of boxes that can hold a combination of items with value between n and N.
Question #1: Is this a correct/useful way to express the problem? Does it seem like I've missed anything obvious?
As I see it, since there are two combinations that I'm looking for I need to break the problem into two parts. So far I think breaking the problem up like this is a good step:
How many objects can box (or combination of boxes) X hold?
Check all (or preferably some small subset of) the possible combinations of boxes and pick the 'best'.
That makes it a little more manageable, but I'm still struggling with the details.
Question #2: Solved To solve the first part I think it's appropriate to say that the cost of an object is equal to the sum of its length in all dimensions, so is it's value. That places me into a subset-sum problem, right? Obviously it's a special case, but does this problem have a name?
Question #3: Solved I've been looking into subset-sum solutions a lot, but I don't understand how to apply them to something like this in multiple dimensions. I assume it's been done before, but I'm unsure where to start my research. Could someone either describe the principles at work or point me in a research direction?
Edit: After looking at everyone's feedback and digging into the terms I think I've found a good algorithm I can implement to solve part 1. Since I will have a very large number of dimensions compared to the number of items it looks like using a 'primal effective capacity heuristic (PECH)' will be a good fit. I'd be interested in hearing someones thoughts about it if they have experience with such an algorithm.
Question #4: For the second part, performance is a concern and I doubt it will be realistic to brute force it. So I intend to treat all combinations of boxes as a really big tree of solutions. The idea is to compute part 1 for all combinations of M-1 boxes where M is the total number of boxes. Somehow determine the 'best' couple box combinations from that set and do the same to their child nodes on the tree. Does this sound like it would help me arrive at something close to optimal? How would I choose the 'best' box combinations?
Thanks for reading! Suggestions for edits and clarifications are welcome.
I’m having trouble solving a deceptively simple problem. My girlfriend and I are trying to formulate weekly meal plans and I had this brilliant idea that I could optimize what we buy in order to maximize the things that we could make from it. The trouble is, the problem is not as easy as it appears. Here’s the problem statement in a nutshell:
The problem:
Given a list of 100 ingredients and a list of 50 dishes that are composed of one or more of the 100 ingredients, find a list of 32 ingredients that can produce the maximum number of dishes.
This problem seems simple, but I’m finding that computing the answer is not trivial. The approach that I’ve taken is that I’ve computed a combination of the 32 ingredients as a 100 bit string with 32 of the bits set. Then I do a check of what dishes can be made with that ingredient number. If the number of dishes is greater than the current maximum, I save off the list. Then I compute the next valid ingredient combination and repeat, repeat, and repeat.
The number of combinations of the 32 ingredients is staggering! The way that I see it, it would take about 300 trillion years to calculate using my method. I’ve optimized the code so that each combination takes a mere 75 microseconds to figure out. Assuming that I can optimize the code, I might be able to reduce the run time to a mere trillion years.
I’m thinking that a completely new approach is in order. I'm currently coding this in XOJO (REALbasic), but I think the real problem is with approach rather than specific implementation. Anybody have an idea for an approach that has a chance of completion during this century?
Thanks,
Ron
mcdowella's branch and bound solution will be a big improvement over exhaustive enumeration, but it might still take a few thousand years. This is the kind of problem that is really best solved by an ILP solver.
Assuming that the set of ingredients for meal i is given by R[i] = { R[i][1], R[i][2], ..., R[i][|R[i]|] }, you can encode the problem as follows:
Create an integer variable x[i] for each ingredient 1 <= i <= 100. Each of these variables should be constrained to the range [0, 1].
Create an integer variable y[i] for each meal 1 <= i <= 50. Each of these variables should be constrained to the range [0, 1].
For each meal i, create |R[i]| additional constraints of the form y[i] <= x[R[i][j]] for 1 <= j <= |R[i]|. These will guarantee that we can only set y[i] to 1 if all of meal i's ingredients have been included.
Add a constraint that the sum of all x[i] must be <= 32.
Finally, the objective function should be the sum of all y[i], and we should be trying to maximise this.
Solving this will produce assignments for all the variables x[i]: 1 means the ingredient should be included, 0 means it should not.
My feeling is that a commercial ILP solver like CPLEX or Gurobi will probably solve a 150-variable ILP problem like this in milliseconds; even freely available solvers like lp_solve, which as a rule are much slower, should have no problems. In the unlikely case that it seems to be taking forever, you can still solve the LP relaxation, which will be very fast (milliseconds) and will give you (a) an upper bound on the maximum number of meals that can be prepared and (b) "hints" in the variable values: although the x[i] will in general not be exactly 0 or 1, values close to 1 are suggestive of ingredients that should be included, while values close to 0 suggest unhelpful ingredients.
There will be a http://en.wikipedia.org/wiki/Branch_and_bound solution to this, but it may be too expensive to get the exact answer - ILP as suggested by j_random_hacker is probably better - the LP relaxation of that is probably a better heuristic than the relaxation proposed here, and the ILP solver will be heavily optimized.
The basic idea is that you do a recursive depth first search of a tree of partial solutions, extending them one at a time. Once you recurse far enough down to reach a fully populated solution you can start keeping track of the best solution found so far. If I label your ingredients A, B, C, D... a partial solution is a list of ingredients of length <= 32. You start with the zero-length solution, then when you visit a partial solution e.g. ABC you consider ABCD, ABCE, ... and so on, and may visit some of these.
For each partial solution you work out the maximum score that any descendant of that solution could achieve. Getting an accurate idea of this is important. Here is one suggestion - suppose you have a partial solution of length 20. This leaves 12 ingredients to be chosen, so the best you could possibly do is to make all dishes which require no more than 12 ingredients not already in the 20 you have chosen so far work out how many of those there are and this is one example of a best possible score to any descendant of the partial solution.
Now when you consider extending the partial solution ABC to ABCD or ABCE or ABCF... if you have a best solution found so far you can ignore all extensions that cannot possibly score more than the best solution so far - this means that you don't need to consider all possible combinations of your 32 ingredients.
Once you have worked out which of the possible extensions might contain a new best answer, your recursive search should continue with the most promising of these possible extensions, because this is the one most likely to survive finding a better best solution so far.
One way to make this fast is to code it cleverly so that recursing up and down means only small changes to the existing data structure which you typically make on the way down and reverse on the way up.
Another way is to cut corners. One obvious way is to stop when you run out of time and go for the best solution found so far at that stage. Another way is to discard partial solutions more aggressively. If you have a score so far of e.g. 100 you could discard partial solutions that couldn't score any better than 110. This speeds up the search, and you know that although you might have best a better answer than 100 whatever you missed could not have been better than 110.
Solving some discrete mathematics huh? Well here is the wiki.
You also have not factored in anything about quantity. For example, flour would be used in a lot of fried recipes but buying 10 pounds of flour might not be great. And cost might be prohibitive for some ingredients that your solution wants. Not to mention a lot of ingredients are in everything. (milk, water, salt, pepper, sugar things like that)
In reality, optimization to this degree is probably not necessary. But I will not provide relationship advice on SO.
As for a new solution:
I would suggest identifying a lot of what you want to make and with what, and then writing a program to suggest things to make with the rest.
Why not just order the list of ingredients by the number of dishes they are used in?
This would be more like a greedy solution, of course, but it should give you some clues about what ingredients are most often used. From that you can compile a list of dishes that can be cooked already with the top 30 (or whatever) ingredients.
Also you could order the list of remaining (non-cookable) dishes by number of missing ingredients and maybe try to optimize on that to maximize the number of cookable dishes.
To be more "algorithmic", I think a local search is most promising here. Start with a candidate solution (random assignments to the 32 ingredients) and calculate as a fitness function the number of cookable dishes. Then check the neighboring states (switching one ingredient) and move to the state with the highest value. Repeat until a maximum is reached. Do this veeeery often and you should find a good solution. (This would be a simple greedy hill-climbing algorithm)
There are a lot of local search algorithms, you should be able to find more than enough information on the net. Most often you won't find the optimal solution (of course that depends on the problem), but a very good one nonetheless.
We have a simulation program where we take a very large population of individual people and group them into families. Each family is then run through the simulation.
I am in charge of grouping the individuals into families, and I think it is a really cool problem.
Right now, my technique is pretty naive/simple. Each individual record has some characteristics, including married/single, age, gender, and income level. For married people I select an individual and loop through the population and look for a match based on a match function. For people/couples with children I essentially do the same thing, looking for a random number of children (selected according to an empirical distribution) and then loop through all of the children and pick them out and add them to the family based on a match function. After this, not everybody is matched, so I relax the restrictions in my match function and loop through again. I keep doing this, but I stop before my match function gets too ridiculous (marries 85-year-olds to 20-year-olds for example). Anyone who is leftover is written out as a single person.
This works well enough for our current purposes, and I'll probably never get time or permission to rework it, but I at least want to plan for the occasion or learn some cool stuff - even if I never use it. Also, I'm afraid the algorithm will not work very well for smaller sample sizes. Does anybody know what type of algorithms I can study that might relate to this problem or how I might go about formalizing it?
For reference, I'm comfortable with chapters 1-26 of CLRS, but I haven't really touched NP-Completeness or Approximation Algorithms. Not that you shouldn't bring up those topics, but if you do, maybe go easy on me because I probably won't understand everything you are talking about right away. :) I also don't really know anything about evolutionary algorithms.
Edit: I am specifically looking to improve the following:
Less ridiculous marriages.
Less single people at the end.
Perhaps what you are looking for is cluster analysis?
Lets try to think of your problem like this (starting by solving the spouses matching):
If you were to have a matrix where each row is a male and each column is a female, and every cell in that matrix is the match function's returned value, what you are now looking for is selecting cells so that there won't be a row or a column in which more than one cell is selected, and the total sum of all selected cells should be maximal. This is very similar to the N Queens Problem, with the modification that each allocation of a "queen" has a reward (which we should maximize).
You could solve this problem by using a graph where:
You have a root,
each of the first raw's cells' values is an edge's weight leading to first depth vertices
each of the second raw's cells' values is an edge's weight leading to second depth vertices..
Etc.
(Notice that when you find a match to the first female, you shouldn't consider her anymore, and so for every other female you find a match to)
Then finding the maximum allocation can be done by BFS, or better still by A* (notice A* typically looks for minimum cost, so you'll have to modify it a bit).
For matching between couples (or singles, more on that later..) and children, I think KNN with some modifications is your best bet, but you'll need to optimize it to your needs. But now I have to relate to your edit..
How do you measure your algorithm's efficiency?
You need a function that receives the expected distribution of all states (single, married with one children, single with two children, etc.), and the distribution of all states in your solution, and grades the solution accordingly. How do you calculate the expected distribution? That's quite a bit of statistics work..
First you need to know the distribution of all states (single, married.. as mentioned above) in the population,
then you need to know the distribution of ages and genders in the population,
and last thing you need to know - the distribution of ages and genders in your population.
Only then, according to those three, can you calculate how many people you expect to be in each state.. And then you can measure the distance between what you expected and what you got... That is a lot of typing.. Sorry for the general parts...