Is there a way to modify topological sort to handle concurrent prerequites? - algorithm

I've been trying to build a schedule generator for my school using topological sort, but am stuck dealing with classes that have prerequisites that can be taken concurrently. I was wondering if there was any clever way to modify topological sort to deal with these concurrent classes? For example, an intro to CS course can either be taken before a Data Structures course or at the same time as a Data Structures course. I'm trying to include the case where they are taken together.

You could create a dummy node, combining the two courses together (assuming each course has low number of concurrent courses at most, as you will likely need all combination of them... Should work just fine if you have only one or two concurrent courses)
The prerequisites of the combined node will be the combined prerequisites of both courses, and all courses that have any prerequisite of one of these will have the dummy node as well.
As postprocessing, once topological sort has ended, you can cleanup the redundancies, and split dummy nodes back to the original courses.
That said, note that topological sort doesn't guarantee you to actually use this dummy node - even if it's possible, before using the original nodes. So there is no guarantee it will actually be used, unless you tie break in favor of them when possible.

Can't mathematically guarantee it's correctness, but this slight modification should work.
Use the normal topological sorting with one difference. Assign all possible beginning nodes a value of 0. For each node that is queued, assign it value of parent node's value + 1. That way, all nodes at a given value would ideally be parallel and can be picked together.

Kahn's algorithm for topological sorting naturally produces a minimum length schedule with concurrency:
Make a dependency graph of all your courses
Select all courses with no dependencies. These can be taken concurrently.
Remove the selected courses from the graph.
If the graph is not empty, go back to (2)
Of course, students are limited in the number of courses they can take simultaneously, and the problem gets tricky when you also impose a limit on maximum concurrency. Deciding the best courses to take first, when too many courses are available, is an NP-hard problem. There are some heuristics you can try, though, like deferring the jobs with the shortest dependant depth.

If you think about exactly what you want as output, it might clear out. For instance, if your desired output is a potential list of what courses to take which semester, then each vertex involved in the topological sort could be “course X on semester Y” rather than just “course X”. Then you'd get these edges, among many others:
intro to CS on semester 1 → data structures on semester 1
intro to CS
on semester 1 → data structures on semester 2
This graph would be larger than if the vertices are just courses of course: the number of vertices is now the number of courses times the maximum number of semesters in your education. But in a realistic setting, it appears to me that it wouldn't be too much to handle.

Related

Algorithm for matching people together based on likes and dislikes

I have a group of about 75 people. Each user has liked or disliked the other 74 users. These people need to be divided in about 15 groups of various sizes (4 tot 8 people). They need to be grouped together so that the groups consist only of people who all liked eachother, or at least as much as possible.
I'm not sure what the best algorithm is to tackle this problem. Any pointers or pseudo code much appreciated!
This isn't formed quite well enough to suggest a particular algorithm. I suggest clustering and "clique" algorithms, but you'll still need to define your "best grouping" metric. "as much as possible", in the face of trade-offs and undefined desires, is meaningless. Your clustering algorithm will need this metric to form your groups.
Data representation is simple: you need a directed graph. An edge from A to B means that A likes B; lack of an edge means A doesn't like B. That will encode the "likes" information in a form tractable to your algorithm. You have 75 nodes and one edge for every "like".
Start by researching clique algorithms; a "clique" is a set in which every member likes every other member. These will likely form the basis of your clustering.
Note, however, that you have to define your trade-offs. For instance, consider the case of 13 nodes consisting of two distinct cliques of 4 and 8 people, plus one person who likes one member of the 8-clique. There are no other "likes" in the graph.
How do you place that 13th person? Do you split the 8-clique and add them to the group with the person they like? If so, you do split off 3 or 4 people form the 8? Is it fair to break 15 or 16 "likes" to put that person with the one person they like -- who doesn't like them? Is it better to add the 13th person to the mutually antagonistic clique of 4?
Your eval function must return a well-ordered metric for all of these situations. It will need to support adding to a group, splitting a large group, etc.
It sounds like a clustering problem.
Each user is a node. If two users liked each other, there is a path between the nodes.
If users disliked each other, or one like another but not the other way around, then there is no path between those nodes.
Once you process the like information into a graph, you will get a connected graph (maybe some nodes will be isolated if no one likes that user). Now the question becomes how to cut that graph into clusters of 4-8 connected nodes, which is a well studied problem with a lot of possible algorithms:
https://www.google.com/search?q=divide+connected+graph+into+clusters
If you want to differentiate between the cases when two people dislike each other vs one person likes another and that person dislikes the first one, than you can also introduce weight on the path - each like is +1, and dislike is -1. Then the question becomes that of partitioning a weighted graph.

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

Weighted bipartite matching with constraints on degrees of vertices

I have a problem that I was able to conceptualize as following:
We have a set of n people. And m subsets representing their ethnicity like White, Hispanic, Asian etc.
Given any combination of these people, I want to check if it is a diverse group.
A diverse group is a group that satisfies several requirements, each requirement is of the form "at least Ki persons in the group belong to subset Si". Here is the tricky part, one person can only be used to satisfy one requirement. As in, you can't use him/her for multiple requirements.
An example:
Given:
At least two people from Hispanic= {a,b,c}
At least two people from Asian={a,d,e}
Is the group {a,c,d} a diverse group?
The group {a,c,d} is not diverse because you cant count a as Hispanic and Asian. But,
the group {a,c,d,e,f} is diverse because we have two Hispanics a and c and two Asian d and e.
Attempt:
This is an instance of the Assignment problem. The jobs are the ethnicity and we can put as many ethnicity as the requirement dictate. For example, if we need two Hispanic, then we put two Hispanic jobs. However there only some people are able to do a particular job.
This is my attempt so far:
I will construct a bipartite graph with the set of people P on one hand and the set of ethnicity on the other S. We will put an edge between a person p_i and an ethnicity S_i if he/she belongs to the ethnicity.
Now, we will modify the graph, for every ethnicity S_i duplicate it k_i times (S_{i,1}, S_{i,2}, ... , S_{i,k_i}). And add new edges accordingly. Find the maximum matching M of this graph.
Now, merge the S_{i,j} s into one S_i and there you have a diverse group. However, a maximum matching is only a possible solution to to the problem. And my problem is a decision problem, I want to check if a given group is a solution or not.
I think this is an instance of the http://en.wikipedia.org/wiki/Assignment_problem, usually described in terms of assigning people to jobs, so in your case the job is "sit there and look white" or "sit there and look hispanic". Only some people are qualified to do any particular job, and they can only do one job at a time.
Normally the assignment algorithm minimizes a cost, but you can just use cost 0/cost 1 for "is in the right ethnic group" or not.
One means of solving this is the http://en.wikipedia.org/wiki/Hungarian_algorithm. This is often presented for the case in which there are exactly as many workers as jobs, but you can always invent dummy jobs or dummy workers, with all costs associated with dummies the same cost, so that optimizing the problem with dummies reproduces exactly the relative order of costs you would get if you ignored assignments to dummies, and so the optimum with dummies is the same choice, after ignoring dummies, as the optimum without.

algorithm for solving resource allocation problems

Hi I am building a program wherein students are signing up for an exam which is conducted at several cities through out the country. While signing up students provide a list of three cities where they would like to give the exam in order of their preference. So a student may say his first preference for an exam centre is New York followed by Chicago followed by Boston.
Now keeping in mind that as the exam centres have limited capacity they cannot accomodate each students first choice .We would however try and provide as many students either their first or second choice of centres and as far as possible avoid students having to give the third choice centre to a student
Now any ideas of a sorting algorithm that would mke this process more efficent.The simple way to do this would be to first go through the list of first choice of students allot as many as possible then go through the list of second choices and allot. However this may lead to the students who are first in the list getting their first centre and the last students getting their third choice or worse none of their choices. Anything that could make this more efficient
Sounds like a variant of the classic stable marriages problem or the college admission problem. The Wikipedia lists a linear-time (in the number of preferences, O(n²) in the number of persons) algorithm for the former; the NRMP describes an efficient algorithm for the latter.
I suspect that if you randomly generate preferences of exam places for students (one Fisher–Yates shuffle per exam place) and then apply the stable marriages algorithm, you'll get a pretty fair and efficient solution.
This problem could be formulated as an instance of minimum cost flow. Let N be the number of students. Let each student be a source vertex with capacity 1. Let each exam center be a sink vertex with capacity, well, its capacity. Make an arc from each student to his first, second, and third choices. Set the cost of first choice arcs to 0; the cost of second choice arcs to 1; and the cost of third choice arcs to N + 1.
Find a minimum-cost flow that moves N units of flow. Assuming that your solver returns an integral solution (it should; flow LPs are totally unimodular), each student flows one unit to his assigned center. The costs minimize the number of third-choice assignments, breaking ties by the number of second-choice assignments.
There are a class of algorithms that address this allocating of limited resources called auctions. Basically in this case each student would get a certain amount of money (a number they can spend), then your software would make bids between those students. You might use a formula based on preferences.
An example would be for tutorial times. If you put down your preferences, then you would effectively bid more for those times and less for the times you don't want. So if you don't get your preferences you have more "money" to bid with for other tutorials.

Looking for a multidimensional optimization algorithm

Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.

Resources