Grouping individuals into families - algorithm

We have a simulation program where we take a very large population of individual people and group them into families. Each family is then run through the simulation.
I am in charge of grouping the individuals into families, and I think it is a really cool problem.
Right now, my technique is pretty naive/simple. Each individual record has some characteristics, including married/single, age, gender, and income level. For married people I select an individual and loop through the population and look for a match based on a match function. For people/couples with children I essentially do the same thing, looking for a random number of children (selected according to an empirical distribution) and then loop through all of the children and pick them out and add them to the family based on a match function. After this, not everybody is matched, so I relax the restrictions in my match function and loop through again. I keep doing this, but I stop before my match function gets too ridiculous (marries 85-year-olds to 20-year-olds for example). Anyone who is leftover is written out as a single person.
This works well enough for our current purposes, and I'll probably never get time or permission to rework it, but I at least want to plan for the occasion or learn some cool stuff - even if I never use it. Also, I'm afraid the algorithm will not work very well for smaller sample sizes. Does anybody know what type of algorithms I can study that might relate to this problem or how I might go about formalizing it?
For reference, I'm comfortable with chapters 1-26 of CLRS, but I haven't really touched NP-Completeness or Approximation Algorithms. Not that you shouldn't bring up those topics, but if you do, maybe go easy on me because I probably won't understand everything you are talking about right away. :) I also don't really know anything about evolutionary algorithms.
Edit: I am specifically looking to improve the following:
Less ridiculous marriages.
Less single people at the end.

Perhaps what you are looking for is cluster analysis?

Lets try to think of your problem like this (starting by solving the spouses matching):
If you were to have a matrix where each row is a male and each column is a female, and every cell in that matrix is the match function's returned value, what you are now looking for is selecting cells so that there won't be a row or a column in which more than one cell is selected, and the total sum of all selected cells should be maximal. This is very similar to the N Queens Problem, with the modification that each allocation of a "queen" has a reward (which we should maximize).
You could solve this problem by using a graph where:
You have a root,
each of the first raw's cells' values is an edge's weight leading to first depth vertices
each of the second raw's cells' values is an edge's weight leading to second depth vertices..
Etc.
(Notice that when you find a match to the first female, you shouldn't consider her anymore, and so for every other female you find a match to)
Then finding the maximum allocation can be done by BFS, or better still by A* (notice A* typically looks for minimum cost, so you'll have to modify it a bit).
For matching between couples (or singles, more on that later..) and children, I think KNN with some modifications is your best bet, but you'll need to optimize it to your needs. But now I have to relate to your edit..
How do you measure your algorithm's efficiency?
You need a function that receives the expected distribution of all states (single, married with one children, single with two children, etc.), and the distribution of all states in your solution, and grades the solution accordingly. How do you calculate the expected distribution? That's quite a bit of statistics work..
First you need to know the distribution of all states (single, married.. as mentioned above) in the population,
then you need to know the distribution of ages and genders in the population,
and last thing you need to know - the distribution of ages and genders in your population.
Only then, according to those three, can you calculate how many people you expect to be in each state.. And then you can measure the distance between what you expected and what you got... That is a lot of typing.. Sorry for the general parts...

Related

Algorithm to allocate resources by priority

My problem is the following:
Me and my team are moving to another part of the office and we have to decide everybody's place to sit. However, everybody has priorities. I would like to find an algorithm which helps us to distribute the seats in a way that everybody is satisfied. (Or the most of them at least.)
I've started to implement my own algorithm where I ask 3 preferred options (the team consists of 10 people and there are 10 places) from everybody and consider there "seniority" (the length of the time they have spent in the team) as a rank between them.
However, I've stuck without any luck, tried to browse the internet for an algorithm which solves a similar problem but didn't find any.
What would be the best way to solve this? Is there any
generally known algorithm which solves this or a similar problem?
Thank you!
What first comes to mind for me is the stable marriage problem. Here's the problem statement for the original algorithm:
Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable.
Please read up on the Gale–Shapley algorithm, which is what I'll adapt for this problem.
Have each worker make a list of their rankings for all the spots. These will be the "men". Then, each spot will use the seniority ranking as their rankings for the "men". The spots will be the "women" in the Gale-Shapley algorithm.
You will get a seat assignment that has no "unstable marriage". Here's what an unstable marriage is:
There is an element A of the first matched set which prefers some given element B of the second matched set over the element to which A is already matched, and
B also prefers A over the element to which B is already matched.
In this context, an unstable marriage means that there is a worker-seat between W1 and S1 assignment such that another worker, W2, has ranked S1 higher. Not only that, S1 has also ranked W2 higher. Since the seats made their list based off the seniority list, it means that W2 has higher seniority.
In effect, this means that you'll get a seating assignment such that no worker has a seat that someone else with higher seniority wants "more".
The bottom of that Wiki article mentions packages in R and Python that have already implemented the algorithm, so it's just up to you to input the preference lists.
Disclaimer: This is probably not the most efficient algorithm. All the seats have the same ranking list, so there's probably a shortcut somewhere. However, it's easier to use a cannon to kill a fly, if the cannon is already written in R/Python for you. Also, this is the only algorithm I remember from uni, so this is the only hammer I have for any nail.
I decided to implement a brute force solution as lots of the comments suggested.
So:
I asked everybody from the team to give a preference order between the seats (10 to 1, what I use as score the "teamMember-seat" pairings, 10 is the highest score)
collected all of the "teamMember-seat" pairings with scores e.g. name:Steve, seat:seat1, score:5 (the score is from the given order from the previous step)
generated all the possible sitting combination from these
e.g.
List1: [name:Steve seat:seat1 score:5], [name:John seat:seat2 score:3] ... [name:X seat:seatY score:X]
List2: [name:Steve seat:seat2 score:4], [name:John seat:seat1 score:4] ... [name:X seat:seatY score:X]
...
ListX: [],[]...
chose the "teamMember-seat" list(s) with the highest score (score of the list is calculated by summing the scores of the "teamMember-seat" pairings)
if there are 2 lists with equal scores, then the algorithm choose that one where the most senior team members get the most preferred seats of them
if still there are more then one list (combination) the algorithm choose one randomly
I'm sure there are some better algorithms to do this as some of you suggested but I've run out of time.
I didn't post the code since it is really long and not too complicated to implement. However, if you need it, don't hesitate to drop a private message.

What string distance algorithm is best for measuring typing accuracy?

I'm trying to write a function that detects how accurate the user typed a particular phrase/sentence/word/words. My objective is to build an app to train the user's typing accuracy of certain phrases.
My initial instinct is to use the basic levenshtein distance algorithm (mostly because that's the only algo I knew off the top of my head).
But after a bit more research, I saw that Jaro-Winkler is a slightly more interesting algorithm because of its consideration for transpositions.
I even found a link that talks about the differences between these algorithms:
Difference between Jaro-Winkler and Levenshtein distance?
Having read all that, in addition to the respective Wikipedia posts, I am still a little clueless as to which algorithm fits my objective the best.
Since you are grading the quality of typing, and you want to train the student to make zero mistakes, you should use Levenshtein distance, because it is less forgiving.
Additionally, Levenshtein score is more intuitive to understand, and easier to represent graphically, than the Jaro-Winkler results. You can modify Levenshtein algorithm to report insertions, deletions, and mistypes separately, and show end-users a list of corrections. Jaro-Winkler, on the other hand, gives you a score that is hard to show to end-user, because penalties for misspelling in the middle are lower than penalties at the end.
Slightly tongue-in-cheek, but only slightly: build a generative model for typing that gives high (prior) probability to hitting the right letter, and apportion out some probabilities for hitting two neighboring keys at once, two keys from different hands in the wrong order, two keys from the same hand in the wrong order, a key near the correct one, a key far from the correct one, etc. Or perhaps less ad-hoc: give your model a probability for a given sequence of keypresses given the current pair of keys needed to continue the passage. You could do a lot of things with such a model; for example, you could get a "distance"-like metric by giving a likelihood score for the learner's actual performance. But even better would be to give them a report summarizing which kinds of errors they make the most -- after all, why boil their performance down to a single number when many numbers would do? Bonus points if you learn the probabilities for the different kinds of errors from a large corpus of real typists' work.
I mostly agree with the answer given by dasblinkenlight, however, would suggest to use the Damerau-Levenshtein distance instead of only Levenshtein, that is, including transpositions. Transpositions are fairly frequent and easy to make while typing, and there is no good reason why they should incur a double distance penalty with respect to the other possible errors (insertions, deletions, and substitutions).

Need an algorithm approach to calculate meal plan

I’m having trouble solving a deceptively simple problem. My girlfriend and I are trying to formulate weekly meal plans and I had this brilliant idea that I could optimize what we buy in order to maximize the things that we could make from it. The trouble is, the problem is not as easy as it appears. Here’s the problem statement in a nutshell:
The problem:
Given a list of 100 ingredients and a list of 50 dishes that are composed of one or more of the 100 ingredients, find a list of 32 ingredients that can produce the maximum number of dishes.
This problem seems simple, but I’m finding that computing the answer is not trivial. The approach that I’ve taken is that I’ve computed a combination of the 32 ingredients as a 100 bit string with 32 of the bits set. Then I do a check of what dishes can be made with that ingredient number. If the number of dishes is greater than the current maximum, I save off the list. Then I compute the next valid ingredient combination and repeat, repeat, and repeat.
The number of combinations of the 32 ingredients is staggering! The way that I see it, it would take about 300 trillion years to calculate using my method. I’ve optimized the code so that each combination takes a mere 75 microseconds to figure out. Assuming that I can optimize the code, I might be able to reduce the run time to a mere trillion years.
I’m thinking that a completely new approach is in order. I'm currently coding this in XOJO (REALbasic), but I think the real problem is with approach rather than specific implementation. Anybody have an idea for an approach that has a chance of completion during this century?
Thanks,
Ron
mcdowella's branch and bound solution will be a big improvement over exhaustive enumeration, but it might still take a few thousand years. This is the kind of problem that is really best solved by an ILP solver.
Assuming that the set of ingredients for meal i is given by R[i] = { R[i][1], R[i][2], ..., R[i][|R[i]|] }, you can encode the problem as follows:
Create an integer variable x[i] for each ingredient 1 <= i <= 100. Each of these variables should be constrained to the range [0, 1].
Create an integer variable y[i] for each meal 1 <= i <= 50. Each of these variables should be constrained to the range [0, 1].
For each meal i, create |R[i]| additional constraints of the form y[i] <= x[R[i][j]] for 1 <= j <= |R[i]|. These will guarantee that we can only set y[i] to 1 if all of meal i's ingredients have been included.
Add a constraint that the sum of all x[i] must be <= 32.
Finally, the objective function should be the sum of all y[i], and we should be trying to maximise this.
Solving this will produce assignments for all the variables x[i]: 1 means the ingredient should be included, 0 means it should not.
My feeling is that a commercial ILP solver like CPLEX or Gurobi will probably solve a 150-variable ILP problem like this in milliseconds; even freely available solvers like lp_solve, which as a rule are much slower, should have no problems. In the unlikely case that it seems to be taking forever, you can still solve the LP relaxation, which will be very fast (milliseconds) and will give you (a) an upper bound on the maximum number of meals that can be prepared and (b) "hints" in the variable values: although the x[i] will in general not be exactly 0 or 1, values close to 1 are suggestive of ingredients that should be included, while values close to 0 suggest unhelpful ingredients.
There will be a http://en.wikipedia.org/wiki/Branch_and_bound solution to this, but it may be too expensive to get the exact answer - ILP as suggested by j_random_hacker is probably better - the LP relaxation of that is probably a better heuristic than the relaxation proposed here, and the ILP solver will be heavily optimized.
The basic idea is that you do a recursive depth first search of a tree of partial solutions, extending them one at a time. Once you recurse far enough down to reach a fully populated solution you can start keeping track of the best solution found so far. If I label your ingredients A, B, C, D... a partial solution is a list of ingredients of length <= 32. You start with the zero-length solution, then when you visit a partial solution e.g. ABC you consider ABCD, ABCE, ... and so on, and may visit some of these.
For each partial solution you work out the maximum score that any descendant of that solution could achieve. Getting an accurate idea of this is important. Here is one suggestion - suppose you have a partial solution of length 20. This leaves 12 ingredients to be chosen, so the best you could possibly do is to make all dishes which require no more than 12 ingredients not already in the 20 you have chosen so far work out how many of those there are and this is one example of a best possible score to any descendant of the partial solution.
Now when you consider extending the partial solution ABC to ABCD or ABCE or ABCF... if you have a best solution found so far you can ignore all extensions that cannot possibly score more than the best solution so far - this means that you don't need to consider all possible combinations of your 32 ingredients.
Once you have worked out which of the possible extensions might contain a new best answer, your recursive search should continue with the most promising of these possible extensions, because this is the one most likely to survive finding a better best solution so far.
One way to make this fast is to code it cleverly so that recursing up and down means only small changes to the existing data structure which you typically make on the way down and reverse on the way up.
Another way is to cut corners. One obvious way is to stop when you run out of time and go for the best solution found so far at that stage. Another way is to discard partial solutions more aggressively. If you have a score so far of e.g. 100 you could discard partial solutions that couldn't score any better than 110. This speeds up the search, and you know that although you might have best a better answer than 100 whatever you missed could not have been better than 110.
Solving some discrete mathematics huh? Well here is the wiki.
You also have not factored in anything about quantity. For example, flour would be used in a lot of fried recipes but buying 10 pounds of flour might not be great. And cost might be prohibitive for some ingredients that your solution wants. Not to mention a lot of ingredients are in everything. (milk, water, salt, pepper, sugar things like that)
In reality, optimization to this degree is probably not necessary. But I will not provide relationship advice on SO.
As for a new solution:
I would suggest identifying a lot of what you want to make and with what, and then writing a program to suggest things to make with the rest.
Why not just order the list of ingredients by the number of dishes they are used in?
This would be more like a greedy solution, of course, but it should give you some clues about what ingredients are most often used. From that you can compile a list of dishes that can be cooked already with the top 30 (or whatever) ingredients.
Also you could order the list of remaining (non-cookable) dishes by number of missing ingredients and maybe try to optimize on that to maximize the number of cookable dishes.
To be more "algorithmic", I think a local search is most promising here. Start with a candidate solution (random assignments to the 32 ingredients) and calculate as a fitness function the number of cookable dishes. Then check the neighboring states (switching one ingredient) and move to the state with the highest value. Repeat until a maximum is reached. Do this veeeery often and you should find a good solution. (This would be a simple greedy hill-climbing algorithm)
There are a lot of local search algorithms, you should be able to find more than enough information on the net. Most often you won't find the optimal solution (of course that depends on the problem), but a very good one nonetheless.

Writing Simulated Annealing algorithm for 0-1 knapsack in C#

I'm in the process of learning about simulated annealing algorithms and have a few questions on how I would modify an example algorithm to solve a 0-1 knapsack problem.
I found this great code on CP:
http://www.codeproject.com/KB/recipes/simulatedAnnealingTSP.aspx
I'm pretty sure I understand how it all works now (except the whole Bolzman condition, as far as I'm concerned is black magic, though I understand about escaping local optimums and apparently this does exactly that). I'd like to re-design this to solve a 0-1 knapsack-"ish" problem. Basically I'm putting one of 5,000 objects in 10 sacks and need to optimize for the least unused space. The actual "score" I assign to a solution is a bit more complex, but not related to the algorithm.
This seems easy enough. This means the Anneal() function would be basically the same. I'd have to implement the GetNextArrangement() function to fit my needs. In the TSM problem, he just swaps two random nodes along the path (ie, he makes a very small change each iteration).
For my problem, on the first iteration, I'd pick 10 random objects and look at the leftover space. For the next iteration, would I just pick 10 new random objects? Or am I best only swapping out a few of the objects, like half of them or only even one of them? Or maybe the number of objects I swap out should be relative to the temperature? Any of these seem doable to me, I'm just wondering if someone has some advice on the best approach (though I can mess around with improvements once I have the code working).
Thanks!
Mike
With simulated annealing, you want to make neighbour states as close in energy as possible. If the neighbours have significantly greater energy, then it will just never jump to them without a very high temperature -- high enough that it will never make progress. On the other hand, if you can come up with heuristics that exploit lower-energy states, then exploit them.
For the TSP, this means swapping adjacent cities. For your problem, I'd suggest a conditional neighbour selection algorithm as follows:
If there are objects that fit in the empty space, then it always puts the biggest one in.
If no objects fit in the empty space, then pick an object to swap out -- but prefer to swap objects of similar sizes.
That is, objects have a probability inverse to the difference in their sizes. You might want to use something like roulette selection here, with the slice size being something like (1 / (size1 - size2)^2).
Ah, I think I found my answer on Wikipedia.. It suggests moving to a "neighbor" state, which usually implies changing as little as possible (like swapping two cities in a TSM problem)..
From: http://en.wikipedia.org/wiki/Simulated_annealing
"The neighbours of a state are new states of the problem that are produced after altering the given state in some particular way. For example, in the traveling salesman problem, each state is typically defined as a particular permutation of the cities to be visited. The neighbours of some particular permutation are the permutations that are produced for example by interchanging a pair of adjacent cities. The action taken to alter the solution in order to find neighbouring solutions is called "move" and different "moves" give different neighbours. These moves usually result in minimal alterations of the solution, as the previous example depicts, in order to help an algorithm to optimize the solution to the maximum extent and also to retain the already optimum parts of the solution and affect only the suboptimum parts. In the previous example, the parts of the solution are the parts of the tour."
So I believe my GetNextArrangement function would want to swap out a random item with an item unused in the set..

Looking for a multidimensional optimization algorithm

Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.

Resources