Algorithm for filling a matrix of item, item pairs - ruby

Hey guys, I have a sort of speed dating type application (not used for dating, just a similar concept) that compares users and matches them in a round based event.
Currently I am storing each user to user comparison (using cosine similarity) and then finding a round in which both users are available. My current set up works fine for smaller scale but I seem to be missing a few matchings in larger data sets.
For example with a setup like so (assuming 6 users, 3 from each group)
Round (User1, User2)
----------------------------
1 (x1,y1) (x2,y2) (x3,y3)
2 (x1,y2) (x2,y3) (x3,y1)
3 (x1,y3) (x2,y1) (x3,y2)
My approach works well right now to ensure I have each user meeting the appropriate user without having overlaps so a user is left out, just not with larger data sets.
My current algorithm
I store a comparison of each user from x to each user from y like so
Round, user1, user2, similarity
And to build the event schedule I simply sort the comparisons by similarity and then iterate over the results, finding an open round for both users, like so:
event.user_maps.all(:order => 'similarity desc').each do |map|
(1..event.rounds).each do |round|
if user_free_in_round?(map.user1) and user_free_in_round?(map.user2)
#creates the pairing and breaks from the loop
end
end
end
This isn't exact code but the general algorithm to build the schedule. Does anyone know a better way of filling in a matrix of item pairings where no one item can be in more than one place in the same slot?
EDIT
For some clarification, the issue I am having is that in larger sets my algorithm of placing highest similarity matches first can sometimes result in collisions. What I mean by that is that the users are paired in such a way that they have no other user to meet with.
Like so:
Round (User1, User2)
----------------------------
1 (x1,y1) (x2,y2) (x3,y3)
2 (x1,y3) (x2,nil) (x3,y1)
3 (x1,y2) (x2,y1) (x3,y2)
I want to be able to prevent this from happening while preserving the need for higher similar users given higher priority in scheduling.
In real scenarios there are far more matches than there are available rounds and an uneven number of x users to y users and in my test cases instead of getting every round full I will only have about 90% or so of them filled while collisions like the above are causing problems.

I think the question still needs clarification even after edit, but I could be missing something.
As far as I can tell, what you want is that each new round should start with the best possible matching (defined as sum of the cosine similarities of all the matched pairs). After any pair (x_i,y_j) have been matched in a round, they are not eligible for the next round.
You could do this by building a bipartite graph where your Xs are nodes in one side and Ys are nodes in another side, and the edge weight is cosine similarity. Then you find the max weighted match in this graph. For the next rounds, you eliminate the edges that have already been used in previous round and run the matching algorithm again. For details on how to code max weight matching in bipartite graph, see here.
BTW, this solution is not optimum since we are proceeding from one round to next in a greedy fashion. I have a feeling that getting the optimum solution would be NP hard, but I don't have a proof so can't be sure.

I agree that the question still needs clarification. As Amit expressed, I have a gut feeling that this is an NP hard problem, so I am assuming that you are looking for an approximate solution.
That said, I would need more information on the tradeoffs you would be willing to make (and perhaps I'm just missing something in your question). What are the explicit goals of the algorithm?
Is there a lower threshold for similarity below which you don't want a pairing to happen? I'm still a bit confused as to why there would be individuals which could not be paired up at all during a given round...
Essentially, you are performing a search over the space of possible pairings, correct? Maybe you could use backtracking or some form of constraint-based algorithm to make sure that you can obtain a complete solution for a given round...?

Related

Mutually Overlapping Subset of Activites

I am prepping for a final and this was a practice problem. It is not a homework problem.
How do I go about attacking this? Also, more generally, how do I know when to use Greedy vs. Dynamic programming? Intuitively, I think this is a good place to use greedy. I'm also thinking that if I could somehow create an orthogonal line and "sweep" it, checking the #of intersections at each point and updating a global max, then I could just return the max at the end of the sweep. I'm not sure how to plane sweep algorithmically though.
a. We are given a set of activities I1 ... In: each activity Ii is represented by its left-point Li and its right-point Ri. Design a very efficient algorithm that finds the maximum number of mutually overlapping subset of activities (write your solution in English, bullet by bullet).
b. Analyze the time complexity of your algorithm.
Proposed solution:
Ex set: {(0,2) (3,7) (4,6) (7,8) (1,5)}
Max is 3 from interval 4-5
1) Split start and end points into two separate arrays and sort them in non-decreasing order
Start points: [0,1,3,4,7] (SP)
End points: [2,5,6,7,8] (EP)
I know that I can use two pointers to sort of simulate the plane sweep, but I'm not exactly sure how. I'm stuck here.
I'd say your idea of a sweep is good.
You don't need to worry about planar sweeping, just use the start/end points. Put the elements in a queue. In every step take the smaller element from the queue front. If it's a start point, increment current tasks count, otherwise decrement it.
Since you don't need to point which tasks are overlapping - just the count of them - you don't need to worry about specific tasks duration.
Regarding your greedy vs DP question, in my non-professional opinion greedy may not always provide valid answer, whereas DP only works for problem that can be divided into smaller subproblems well. In this case, I wouldn't call your sweep-solution either.

Most optimal match-up

Let's assume you're a baseball manager. And you have N pitchers in your bullpen (N<=14) and they have to face M batters (M<=100). Also to mention you know the strength of each of the pitchers and each of the batters. For those who are not familiar to baseball once you brought in a relief pitcher he can pitch to k consecutive batters, but once he's taken out ofthe game he cannot come back.
For each pitcher the probability that he's gonna lose his match-ups is given by (sum of all batter he will face)/(his strength). Try to minimize these probabilities, i.e. try to maximize your chances of winning the game.
For example we have 3 pitchers and they have to face 3 batters. The batters' stregnths are:
10 40 30
While the strength of your pitchers is:
40 30 3
The most optimal solution would be to bring the strongest pitcher to face the first 2 batters and the second to face the third batter. Then the probability of every pitcher losing his game will be:
50/40 = 1.25 and 30/30 = 1
So the probability of losing the game would be 1.25 (This number can be bigger than 100).
How can you find the optimal number? I was thinking to take a greedy approach, but I suspect whether it will always hold. Also the fact that the pitcher can face unlimited number of batters (I mean it's only limited by M) poses the major problem for me.
Probabilities must be in the range [0.0, 1.0] so what you call a probability can't be a probability. I'm just going to call it a score and minimize it.
I'm going to assume for now that you somehow know the order in which the pitchers should play.
Given the order, what is left to decide is how long each pitcher plays. I think you can find this out using dynamic programming. Consider the batters to be faced in order. Build an NxM table best[pitchers, batter] where best[i, j] is the best score you can make considering just the first j batters using the first i pitchers, or HUGE if it does not make sense.
best[1,1] is just the score for the best pitcher against the first batter, and best[1,j] doesn't make sense for any other values of j.
For larger values of i you work out best[i,j] by considering when the last change of pitcher could be, considering all possibilities (so 1, 2, 3...i). If the last change of pitcher was at time t, then look up best[t, j-1] to get the score up to the time just before that change, and then calculate the a/b value to take account of the sum of batter strengths between time t+1 and time i. When you have considered all possible times, take the best score and use it as the value for best[i, j]. Note down enough info (such as the last time of pitcher change that turned out to be best) so that once you have calculated best[N, M], you can trace back to find the best schedule.
You don't actually know the order, and because the final score is the maximum of the a/b value for each pitcher, the order does matter. However, given a separation of players into groups, the best way to assign pitchers to groups is to assign the best pitcher to the group with the highest total score, the next best pitcher to the group with the next best total score, and so on. So you could alternate between dividing batters into groups, as described above, and then assigning pitchers to groups to work out the order the pitchers really should be in - keep doing this until the answer stops changing and hope the result is a global optimum. Unfortunately there is no guarantee of this.
I'm not convinced that your score is a good model for baseball, especially since it started out as a probability but can't be. Perhaps you should work out a few examples (maybe even solving small examples by brute force) and see if the results look reasonable.
Another way to approach this problem is via http://en.wikipedia.org/wiki/Branch_and_bound.
With branch and bound you need some way to describe partial answers, and you need some way to work out a value V for a given partial answer, such that no way of extending that partial answer can possibly produce a better answer than V. Then you run a tree search, extending partial answers in every possible way, but discarding partial answers which can't possibly be any better than the best answer found so far. It is good if you can start off with at least a guess at the best answer, because then you can discard poor partial answers from the start. My other answer might provide a way of getting this.
Here a partial answer is a selection of pitchers, in the order they should play, together with the number of batters they should pitch to. The first partial answer would have 0 pitchers, and you could extend this by choosing each possible pitcher, pitching to each possible number of batters, giving a list of partial answers each mentioning just one pitcher, most of which you could hopefully discard.
Given a partial answer, you can compute the (total batter strength)/(Pitcher strength) for each pitcher in its selection. The maximum found here is one possible way of working out V. There is another calculation you can do. Sum up the total strengths of all the batters left and divide by the total strengths of all the pitchers left. This would be the best possible result you could get for the pitchers left, because it is the result you get if you somehow manage to allocate pitchers to batters as evenly as possible. If this value is greater than the V you have calculated so far, use this instead of V to get a less optimistic (but more accurate) measure of how good any descendant of that partial answer could possibly be.

Optimal placement of objects wrt pairwise similarity weights

Ok this is an abstract algorithmic challenge and it will remain abstract since it is a top secret where I am going to use it.
Suppose we have a set of objects O = {o_1, ..., o_N} and a symmetric similarity matrix S where s_ij is the pairwise correlation of objects o_i and o_j.
Assume also that we have an one-dimensional space with discrete positions where objects may be put (like having N boxes in a row or chairs for people).
Having a certain placement, we may measure the cost of moving from the position of one object to that of another object as the number of boxes we need to pass by until we reach our target multiplied with their pairwise object similarity. Moving from a position to the box right after or before that position has zero cost.
Imagine an example where for three objects we have the following similarity matrix:
1.0 0.5 0.8
S = 0.5 1.0 0.1
0.8 0.1 1.0
Then, the best ordering of objects in the tree boxes is obviously:
[o_3] [o_1] [o_2]
The cost of this ordering is the sum of costs (counting boxes) for moving from one object to all others. So here we have cost only for the distance between o_2 and o_3 equal to 1box * 0.1sim = 0.1, the same as:
[o_3] [o_1] [o_2]
On the other hand:
[o_1] [o_2] [o_3]
would have cost = cost(o_1-->o_3) = 1box * 0.8sim = 0.8.
The target is to determine a placement of the N objects in the available positions in a way that we minimize the above mentioned overall cost for all possible pairs of objects!
An analogue is to imagine that we have a table and chairs side by side in one row only (like the boxes) and you need to put N people to sit on the chairs. Now those ppl have some relations that is -lets say- how probable is one of them to want to speak to another. This is to stand up pass by a number of chairs and speak to the guy there. When the people sit on two successive chairs then they don't need to move in order to talk to each other.
So how can we put those ppl down so that every distance-cost between two ppl are minimized. This means that during the night the overall number of distances walked by the guests are close to minimum.
Greedy search is... ok forget it!
I am interested in hearing if there is a standard formulation of such problem for which I could find some literature, and also different searching approaches (e.g. dynamic programming, tabu search, simulated annealing etc from combinatorial optimization field).
Looking forward to hear your ideas.
PS. My question has something in common with this thread Algorithm for ordering a list of Objects, but I think here it is better posed as problem and probably slightly different.
That sounds like an instance of the Quadratic Assignment Problem. The speciality is due to the fact that the locations are placed on one line only, but I don't think this will make it easier to solve. The QAP in general is NP hard. Unless I misinterpreted your problem you can't find an optimal algorithm that solves the problem in polynomial time without proving P=NP at the same time.
If the instances are small you can use exact methods such as branch and bound. You can also use tabu search or other metaheuristics if the problem is more difficult. We have an implementation of the QAP and some metaheuristics in HeuristicLab. You can configure the problem in the GUI, just paste the similarity and the distance matrix into the appropriate parameters. Try starting with the robust Taboo Search. It's an older, but still quite well working algorithm. Taillard also has the C code for it on his website if you want to implement it for yourself. Our implementation is based on that code.
There has been a lot of publications done on the QAP. More modern algorithms combine genetic search abilities with local search heuristics (e. g. Genetic Local Search from Stützle IIRC).
Here's a variation of the already posted method. I don't think this one is optimal, but it may be a start.
Create a list of all the pairs in descending cost order.
While list not empty:
Pop the head item from the list.
If neither element is in an existing group, create a new group containing
the pair.
If one element is in an existing group, add the other element to whichever
end puts it closer to the group member.
If both elements are in existing groups, combine them so as to minimize
the distance between the pair.
Group combining may require reversal of order in a group, and the data structure should
be designed to support that.
Let me help the thread (of my own) with a simplistic ordering approach.
1. Order the upper half of the similarity matrix.
2. Start with the pair of objects having the highest similarity weight and place them in the center positions.
3. The next object may be put on the left or the right side of them. So each time you may select the object that when put to left or right
has the highest cost to the pre-placed objects. Goto Step 2.
The selection of Step 3 is because if you left this object and place it later this cost will be again the greatest of the remaining, and even more (farther to the pre-placed objects). So the costly placements should be done as earlier as it can be.
This is too simple and of course does not discover a good solution.
Another approach is to
1. start with a complete ordering generated somehow (random or from another algorithm)
2. try to improve it using "swaps" of object pairs.
I believe local minima would be a huge deterrent.

matching algorithm

I'm writing an application which divides a population of users into pairs for the purpose of performing a task together. Each user can specify various preferences about their partner, e.g.
gender
language
age
location (typically, within X miles/kilometers from where the user lives)
Ideally, I would like the user to be able to specify whether each of these preferences is a "nice to have" or a "must have", e.g. "I would prefer to be matched with a native English speaker, but I must not be matched with a female".
My objective is to maximise the overall average quality of the matches. For example, assume there are 4 users in the system, A, B, C, D. These users can be matched in 3 ways:
Option 1 Match Score
A-B 5
C-D 4
---
Average 4.5
Option 2 Match Score
A-C 2
B-D 3
---
Average 2.5
Option 3 Match Score
A-D 1
B-C 9
---
Average 5
So in this contrived example, the 3rd option would be chosen because it has the highest overall match quality, even though A and D are not very well matched at all.
Is there an algorithm that can help me to:
calculate the "match scores" shown above
choose the pairings that will maximise the average match score (while respecting each user's absolute constraints)
It is not absolutely necessary that each user is matched, so given a choice between significantly lowering the overall quality of the matches, and leaving a few users without a match, I would choose the latter.
Obviously, I would like the algorithm that calculates the matches to complete as quickly as possible, because the number of users in the system could be quite large.
Finally, this system of computing match scores and maximising the overall average is just a heurisitic I've come up with myself. If there's a much better way to calculate the pairings, please let me know.
Update
The problem I've described seems to be a similar to the stable marriage problem for which there is a well-known solution. However, in this problem I do not require the chosen pairs to be stable. My goal is to choose the pairs so that the average "match score" is maximized
What maximum match algorithms have you been looking at? I read your question too hastily at first: it seems you don't necessarily restrict yourself to a bipartite graph. This seems trickier.
I believe this problem could be represented as a linear programming problem. And then you can use Simplex method to solve it.
To find a maximum matching in an arbitrary graph there is a weighted variant of Edmond's matching algorithm:
http://en.wikipedia.org/wiki/Edmonds's_matching_algorithm#Weighted_matching
See the footnotes there.
I provided a possible solution to a similar problem here. It's an algorithm for measuring dissimilarity--the more similar measured data is to expected data, the smaller your resulting number will be.
For your application you would set a person's preferences as the expected data and each other person you compare against would be the measured data. You would want to filter the 'measured data' to eliminate those cases like "must not be matched with a female", that you mention in your original question, before running the comparison.
Another option could be using a Chi-Square algorithm.
By the looks of it your problem is not bipartite, therefore it would seem to me that you are looking for a maximum weight matching in a general graph. I don't envy the task of writing this as Edmond's blossum shrinking algorithm is not easy to understand or implement efficiently. There are implementations of this algorithm out there, one such example being the C++ library LEMON (http://lemon.cs.elte.hu/trac/lemon). If you want a maximum cardinality maximum weight matching you will have to use the maximum weight matching algorithm and add a large weight (sum of all the weights) to each edge to force maximum cardinality as the first priority.
Alternatively as you mentioned in one of the comments above that your match terms are not linear and so linear programming is out, you could always take a constraint programming approach which does not require that the terms be linear.

Ticket drafting algorithm

I'm trying to write a program to automate a ticket draft.
We have a certain number of season ticket passes and want to split up the tickets among a group of people. There are X number of games, Y number of season passes, and Z number of people. Each of Z people has ranked the X games.
My code basically goes through the draft order and back picking out the tickets from their ranking if available, otherwise, picking the next highest ranking. For the most part it works. The problem is, there's a point where most of the tickets are taken and the remaining tickets left are ones you already have so you just don't pick them. People therefore have different numbers of tickets. Is there a good way to get around this?
If you have X games and Y season passes, presumably there are X*Y tickets available to give to the Z people, right?
This sounds like it could be treated as an optimization problem, but to do so you have to identify your main goals? I'm guessing you want each person to receive X*Y / Z tickets (split them evenly), but maybe not. I'm guessing you also want to maximize the aggregate satisfaction (defined in some way according to the rankings) in tickets. You would probably want to give a large penalty in satisfaction for a person if he receives more than 1 ticket for the same game. I believe this last aspect might be why the straight draft approach is not the best, but I could be mistaken.
Once you are clear on what you are trying to optimize (if this is indeed an optimization problem), then you can consider the best approach to the problem. This could be your own custom-built solution, or you could try an existing technique (genetic algorithm, etc.). Before doing so though it is important that you frame the problem properly.
If there were no preferences involved, this would be a straight min-cut max flow problem. http://en.wikipedia.org/wiki/Maximum_flow_problem, as follows:
Create a source vertex A. From A, create Z vertices, one for each person. The capacity can be infinite (or very, very large). Create a sink B, and create X vertices, one for each game, linked to B; the capacity should be Y (you have Y tickets per game). From each person, link to each game they've ranked, with capacity 1.
If you look at the wiki link above, there are about 10 algorithms to solve this basic problem. Find one you understand and can implement yourself, because you'll need to modify it slightly. I'm not familiar with all of them, but the ones I know about have a step 'pick an edge' or 'pick a path.' You should modify the 'how you pick an edge' logic to take the priority ordering of the games into account. I'm not sure exactly what the ordering should be (you'll probably need to experiment), but if you say the lowest ranked game is 1, the next is 2, up to X, then a score like 'ranking of the edge - number of games the person is already signed up for' might work.
I think this is a variant of the Stable Marriage Problem or the Stable Roommates Problem for which there are known algorithms for solving.

Resources