Automated scheduling based on availability - algorithm

The data:
Students which have varying availability at certain timestamps (hourly) throughout the week.
The challenge:
Creating a schedule based on the data above, where a single faculty member can meet with each student once that week, without any overlap between students.
What I have tried so far
Creating a filter that checks which students have the least availability and prioritizing them
Distributing based on days where more/fewer students are available
However, none of my attempts even came close to what I need, and I struggle to understand the mathematics of it all. How can I best create such a scheduling tool above?

Create a bipartite graph with one vertex per student and one vertex per timestamp. A student is connected to a timestamp by an edge if and only if this student is available at that timestamp.
Then search for a maximum matching in this bipartite graph. This is also called the assignment problem, and can be solved for instance with the Hungarian algorithm.
Note that this makes the assumption that timestamps are discrete. This might not correspond to the reality. But it's still a good place to start.

Related

Hotel room optimization/sorting algorithm

Is there any well known room optimization/sorting algorithm for hotels ?
Problem is to redistribute rooms to maximize occupancy. Let's say I have 10 rooms, start date and end date for every reservation. Some rooms cannot be realocated while others can (flag).
Any hint on the right direction would be great. Thank you.
If you have a set of reservations and a fixed number of rooms then the question is not how to maximize utilization but to verify if the reservations can be actually realized at all or not. The utilization obviously remains the same if all reservations are realized.
Another possible use case is that you have a set of reservations which you know can be realized, and then you try to fit a new reservation in, i.e. a new customer wants to make a new reservation and you want to check if you can possible relocate some of the reservations to create room for the new one.
In both cases the actual question is how to check if a given set of reservations can be realized.
For the non-relocatable reservations this is trivial, so assume they can be realized and you want to check if the relocatable reservations can be realized also.
The first check is to calculate for every night the number of reservations per that night; if at any night the number of reservations exceeds the number of available rooms once the fixed reservations are accounted for you can't realize the reservations by any tricks; your hotel is overbooked for that night.
Otherwise you can then use a greedy algorithm to attempt a solution: process the reservations in the order of their starting dates and book every reservation to a first room (e.g. in numerical room order) that's available. If this gives you a solution, then you have realized the reservations and you are done.
If that doesn't work, then you can use GRAPH COLORING to solve the problem, and this is then the universal solution. Construct a graph where every reservation is a node and two nodes (reservations) are connected if and only if they overlap time-wise. Include the fixed (not relocatable) reservations in the graph. Then attempt to do complete coloring of the graph with N colors (N = total number of rooms in your hotel) once you have precolored the fixed reservations with the room numbers they pertain to.
You can handle also only partially flexible reservations in this manner, adding a link from reservation r to a special n-precolored node for room n if and only if the reservation can NOT be realized in room n (e.g. lower room class).
This same graph coloring algorithm is used successfully e.g. in compilers for register allocation.
Of course then the question is how to implement graph coloring efficiently; for that there are ready-made implementations.
Good luck!
It is possible to model your problem using mathematical programming or constraint programming, using on of the many ready-available tools (try cplex or gurobi for MP and gecode or cp optimizer for CP, just to name a few) for modelling and solving these classes of problems. They usually have APIs which can be called from most programming languages.
I guess this answer arrives after a very long time, but I hope it can still help somebody else :-)
You want to search for Drools-Planer: http://www.jboss.org/drools.
Backtracking works pretty well for such problems, in my experience. Just start by assigning the first reservation to a room type. Continue assigning reservations until one remains unassigned. Then backtrack where you went wrong, and change earlier decisions accordingly.
This way, you will either find a/all feasible solution(s), or you will prove that none exists.
Advantage is that you can prove non-feasibility. Moreover, backtracking allows you to find the "cause" of non-feasibility.
See e.g. http://en.wikipedia.org/wiki/Backtracking

Classical task-scheduling assignment

I am working on a flight scheduling app (disclaimer: it's for a college project, so no code answers, please). Please read this question w/ a quantum of attention before answering as it has a lot of peculiarities :(
First, some terminology issues:
You have planes and flights, and you have to pair them up. For simplicity's sake, we'll assume that a plane is free as soon as the flight using it prior lands.
Flights are seen as tasks:
They have a duration
They have dependencies
They have an expected date/time for
beginning
Planes can be seen as resources to be used by tasks (or flights, in our terminology).
Flights have a specific type of plane needed. e.g. flight 200 needs a plane of type B.
Planes obviously are of one and only one specific type, e.g., Plane Airforce One is of type C.
A "project" is the set of all the flights by an airline in a given time period.
The functionality required is:
Finding the shortest possible
duration for a said project
The earliest and latest possible
start for a task (flight)
The critical tasks, with basis on
provided data, complete with
identifiers of preceding tasks.
Automatically pair up flights and
planes, so as to get all flights
paired up with a plane. (Note: the
duration of flights is fixed)
Get a Gantt diagram with the projects
scheduling, in which all flights
begin as early as possible, showing
all previously referred data
graphically (dependencies, time info,
etc.)
So the questions is: How in the world do I achieve this? Particularly:
We are required to use a graph.
What do the graph's edges and nodes
respectively symbolise?
Are we required to discard tasks to
achieve the critical tasks set?
If you could also recommend some algorithms for us to look up, that'd be great.
Here some suggestions.
In principle you can have a graph where every node is a flight and there is an edge from flight A to flight B if B depends on A, i.e. B can't take off before A has landed. You can use this dependency graph to calculate the shortest possible duration for the project --- find the path through the graph that has maximum duration when you add the durations of all the flights on the path together. This is the "critical path" of your project.
However, the fact that you need to pair with planes makes it more difficult, esp. as I guess it is assumed that the planes are not allowed to fly without passengers, i.e. a plane must take off from the same city where it landed last.
If you have an excessive number of planes, allocating them to the flights can be most likely easily with a combinatorial optimization algorithm like simulated annealing. If the plan is very tight, i.e. you don't have excess planes, it could be a hard problem.
To set the actual take-off times for your flights, you can for example formulate the allowed schedules as a linear programming problem, or as a semi-definite / quadratic programming problem.
Here some references:
http://en.wikipedia.org/wiki/Simulated_annealing
http://en.wikipedia.org/wiki/Linear_programming
http://en.wikipedia.org/wiki/Quadratic_programming
http://en.wikipedia.org/wiki/Gradient_descent
http://en.wikipedia.org/wiki/Critical_path_method
Start with drawing out a domain model (class diagram) and make a clear separation in your mind between:
planning-immutable facts: PlaneType, Plane, Flight, FlightBeforeFlightConstraint, ...
planning variables: PlaneToFlightAssignment
Wrap all those instances in that Project class (= a Solution).
Then define a score function (AKA fitness function) on such a Solution. For example, if there are 2 PlaneToFlightAssignments which are not ok with a FlightBeforeFlightConstraint (= flight dependency), then lower the score.
Then it's just a matter for finding the Solution with the best score, by changing the PlaneToFlightAssignment instances. There are several algorithms you can use to find that best solution. If your data set is really really small (say 10 planes), you might be able to use brute force.

Algorithm for creating a school timetable

I've been wondering if there are known solutions for algorithm of creating a school timetable. Basically, it's about optimizing "hour-dispersion" (both in teachers and classes case) for given class-subject-teacher associations. We can assume that we have sets of classes, lesson subjects and teachers associated with each other at the input and that timetable should fit between 8AM and 4PM.
I guess that there is probably no accurate algorithm for that, but maybe someone knows a good approximation or hints for developing it.
This problem is NP-Complete!
In a nutshell one needs to explore all possible combinations to find the list of acceptable solutions. Because of the variations in the circumstances in which the problem appears at various schools (for example: Are there constraints with regards to classrooms?, Are some of the classes split in sub-groups some of the time?, Is this a weekly schedule? etc.) there isn't a well known problem class which corresponds to all the scheduling problems. Maybe, the Knapsack problem has many elements of similarity with these problems at large.
A confirmation that this is both a hard problem and one for which people perennially seek a solution, is to check this (long) list of (mostly commercial) software scheduling tools
Because of the big number of variables involved, the biggest source of which are, typically, the faculty member's desires ;-)..., it is typically impractical to consider enumerating all possible combinations. Instead we need to choose an approach which visits a subset of the problem/solution spaces.
- Genetic Algorithms, cited in another answer is (or, IMHO, seems) well equipped to perform this kind of semi-guided search (The problem being to find a good evaluation function for the candidates to be kept for the next generation)
- Graph Rewriting approaches are also of use with this type of combinatorial optimization problems.
Rather than focusing on particular implementations of an automatic schedule generator program, I'd like to suggest a few strategies which can be applied, at the level of the definition of the problem.
The general rationale is that in most real world scheduling problems, some compromises will be required, not all constraints, expressed and implied: will be satisfied fully. Therefore we help ourselves by:
Defining and ranking all known constraints
Reducing the problem space, by manually, providing a set of additional constraints.This may seem counter-intuitive but for example by providing an initial, partially filled schedule (say roughly 30% of the time-slots), in a way that fully satisfies all constraints, and by considering this partial schedule immutable, we significantly reduce the time/space needed to produce candidate solutions. Another way additional constraints help is for example "artificially" adding a constraint which prevent teaching some subjects on some days of the week (if this is a weekly schedule...); this type of constraints results in reducing the problem/solution spaces, without, typically, excluding a significant number of good candidates.
Ensuring that some of the constraints of the problem can be quickly computed. This is often associated with the choice of data model used to represent the problem; the idea is to be able to quickly opt-for (or prune-out) some of the options.
Redefining the problem and allowing some of the constraints to be broken, a few times, (typically towards the end nodes of the graph). The idea here is to either remove some of constraints for filling-in the last few slots in the schedule, or to have the automatic schedule generator program stop shy of completing the whole schedule, instead providing us with a list of a dozen or so plausible candidates. A human is often in a better position to complete the puzzle, as indicated, possibly breaking a few of the contraints, using information which is not typically shared with the automated logic (eg "No mathematics in the afternoon" rule can be broken on occasion for the "advanced math and physics" class; or "It is better to break one of Mr Jones requirements than one of Ms Smith ... ;-) )
In proof-reading this answer , I realize it is quite shy of providing a definite response, but it none the less full of practical suggestions. I hope this help, with what is, after all, a "hard problem".
It's a mess. a royal mess. To add to the answers, already very complete, I want to point out my family experience. My mother was a teacher and used to be involved in the process.
Turns out that having a computer to do so is not only difficult to code per-se, it is also difficult because there are conditions that are difficult to specify to a pre-baked computer program. Examples:
a teacher teaches both at your school and at another institute. Clearly, if he ends the lesson there at 10.30, he cannot start at your premises at 10.30, because he needs some time to commute between the institutes.
two teachers are married. In general, it's considered good practice not to have two married teachers on the same class. These two teachers must therefore have two different classes
two teachers are married, and their child attends the same school. Again, you have to prevent the two teachers to teach in the specific class where their child is.
the school has separate facilities, like one day the class is in one institute, and another day the class is in another.
the school has shared laboratories, but these laboratories are available only on certain weekdays (for security reasons, for example, where additional personnel is required).
some teachers have preferences for the free day: some prefer on Monday, some on Friday, some on Wednesday. Some prefer to come early in the morning, some prefer to come later.
you should not have situations where you have a lesson of say, history at the first hour, then three hours of math, then another hour of history. It does not make sense for the students, nor for the teacher.
you should spread the arguments evenly. It does not make sense to have the first days in the week only math, and then the rest of the week only literature.
you should give some teachers two consecutive hours to do evaluation tests.
As you can see, the problem is not NP-complete, it's NP-insane.
So what they do is that they have a large table with small plastic insets, and they move the insets around until a satisfying result is obtained. They never start from scratch: they normally start from the previous year timetable and make adjustments.
The International Timetabling Competition 2007 had a lesson scheduling track and exam scheduling track. Many researchers participated in that competition. Lots of heuristics and metaheuristics were tried, but in the end the local search metaheuristics (such as Tabu Search and Simulated Annealing) clearly beat other algorithms (such as genetic algorithms).
Take a look at the 2 open source frameworks used by some of the finalists:
JBoss OptaPlanner (Java, open source)
Unitime (Java, open source) - more for universities
One of my half-term assignments was an genetic-algorithm school table generation.
Whole table is one "organism". There were some changes and caveats to the generic genetic algorithms approach:
Rules were made for "illegal tables": two classes in the same classroom, one teacher teaching two groups at the same time etc. These mutations were deemed lethal immediately and a new "organism" was sprouted in place of the "deceased" immediately. The initial one was generated by a series of random tries to get a legal (if senseless) one. Lethal mutation wasn't counted towards count of mutations in iteration.
"Exchange" mutations were much more common than "Modify" mutations. Changes were only between parts of the gene that made sense - no substituting a teacher with a classroom.
Small bonuses were assigned for bundling certain 2 hours together, for assigning same generic classroom in sequence for the same group, for keeping teacher's work hours and class' load continuous. Moderate bonuses were assigned for giving correct classrooms for given subject, keeping class hours within bonds (morning or afternoon), and such. Big bonuses were for assigning correct number of given subject, given workload for a teacher etc.
Teachers could create their workload schedules of "want to work then", "okay to work then", "doesn't like to work then", "can't work then", with proper weights assigned. Whole 24h were legal work hours except night time was very undesired.
The weight function... oh yeah. The weight function was huge, monstrous product (as in multiplication) of weights assigned to selected features and properties. It was extremely steep, one property easily able to change it by an order of magnitude up or down - and there were hundreds or thousands of properties in one organism. This resulted in absolutely HUGE numbers as the weights, and as a direct result, need to use a bignum library (gmp) to perform the calculations. For a small testcase of some 10 groups, 10 teachers and 10 classrooms, the initial set started with note of 10^-200something and finished with 10^+300something. It was totally inefficient when it was more flat. Also, the values grew a lot wider distance with bigger "schools".
Computation time wise, there was little difference between a small population (100) over a long time and a big population (10k+) over less generations. The computation over the same time produced about the same quality.
The calculation (on some 1GHz CPU) would take some 1h to stabilize near 10^+300, generating schedules that looked quite nice, for said 10x10x10 test case.
The problem is easily paralellizable by providing networking facility that would exchange best specimens between computers running the computation.
The resulting program never saw daylight outside getting me a good grade for the semester. It showed some promise but I never got enough motivation to add any GUI and make it usable to general public.
This problem is tougher than it seems.
As others have alluded to, this is a NP-complete problem, but let's analyse what that means.
Basically, it means you have to look at all possible combinations.
But "look at" doesn't tell you much what you need to do.
Generating all possible combinations is easy. It might produce a huge amount of data, but you shouldn't have much problems understanding the concepts of this part of the problem.
The second problem is the one of judging whether a given possible combination is good, bad, or better than the previous "good" solution.
For this you need more than just "is it a possible solution".
For instance, is the same teacher working 5 days a week for X weeks straight? Even if that is a working solution, it might not be a better solution than alternating between two people so that each teacher does one week each. Oh, you didn't think about that? Remember, this is people you're dealing with, not just a resource allocation problem.
Even if one teacher could work full-time for 16 weeks straight, that might be a sub-optimal solution compared to a solution where you try to alternate between teachers, and this kind of balancing is very hard to build into software.
To summarize, producing a good solution to this problem will be worth a lot, to many many people. Hence, it's not an easy problem to break down and solve. Be prepared to stake out some goals that aren't 100% and calling them "good enough".
My timetabling algorithm, implemented in FET (Free Timetabling Software, http://lalescu.ro/liviu/fet/ , a successful application):
The algorithm is heuristic. I named it "recursive swapping".
Input: a set of activities A_1...A_n and the constraints.
Output: a set of times TA_1...TA_n (the time slot of each activity. Rooms are excluded here, for simplicity). The algorithm must put each activity at a time slot, respecting constraints. Each TA_i is between 0 (T_1) and max_time_slots-1 (T_m).
Constraints:
C1) Basic: a list of pairs of activities which cannot be simultaneous (for instance, A_1 and A_2, because they have the same teacher or the same students);
C2) Lots of other constraints (excluded here, for simplicity).
The timetabling algorithm (which I named "recursive swapping"):
Sort activities, most difficult first. Not critical step, but speeds up the algorithm maybe 10 times or more.
Try to place each activity (A_i) in an allowed time slot, following the above order, one at a time. Search for an available slot (T_j) for A_i, in which this activity can be placed respecting the constraints. If more slots are available, choose a random one. If none is available, do recursive swapping:
a. For each time slot T_j, consider what happens if you put A_i into T_j. There will be a list of other activities which don't agree with this move (for instance, activity A_k is on the same slot T_j and has the same teacher or same students as A_i). Keep a list of conflicting activities for each time slot T_j.
b. Choose a slot (T_j) with lowest number of conflicting activities. Say the list of activities in this slot contains 3 activities: A_p, A_q, A_r.
c. Place A_i at T_j and make A_p, A_q, A_r unallocated.
d. Recursively try to place A_p, A_q, A_r (if the level of recursion is not too large, say 14, and if the total number of recursive calls counted since step 2) on A_i began is not too large, say 2*n), as in step 2).
e. If successfully placed A_p, A_q, A_r, return with success, otherwise try other time slots (go to step 2 b) and choose the next best time slot).
f. If all (or a reasonable number of) time slots were tried unsuccessfully, return without success.
g. If we are at level 0, and we had no success in placing A_i, place it like in steps 2 b) and 2 c), but without recursion. We have now 3 - 1 = 2 more activities to place. Go to step 2) (some methods to avoid cycling are used here).
UPDATE: from comments ... should have heuristics too!
I'd go with Prolog ... then use Ruby or Perl or something to cleanup your solution into a prettier form.
teaches(Jill,math).
teaches(Joe,history).
involves(MA101,math).
involves(SS104,history).
myHeuristic(D,A,B) :- [test_case]->D='<';D='>'.
createSchedule :- findall(Class,involves(Class,Subject),Classes),
predsort(myHeuristic,Classes,ClassesNew),
createSchedule(ClassesNew,[]).
createSchedule(Classes,Scheduled) :- [the actual recursive algorithm].
I am (still) in the process of doing something similar to this problem but using the same path as I just mentioned. Prolog (as a functional language) really makes solving NP-Hard problems easier.
Genetic algorithms are often used for such scheduling.
Found this example (Making Class Schedule Using Genetic Algorithm) which matches your requirement pretty well.
Here are a few links I found:
School timetable - Lists some problems involved
A Hybrid Genetic Algorithm for School Timetabling
Scheduling Utilities and Tools
This paper describes the school timetable problem and their approach to the algorithm pretty well: "The Development of SYLLABUS—An Interactive, Constraint-Based Scheduler for Schools and Colleges."[PDF]
The author informs me the SYLLABUS software is still being used/developed here: http://www.scientia.com/uk/
I work on a widely-used scheduling engine which does exactly this. Yes, it is NP-Complete; the best approaches seek to approximate an optimal solution. And, of course there are a lot of different ways to say which one is the "best" solution - is it more important that your teachers are happy with their schedules, or that students get into all their classes, for instance?
The absolute most important question you need to resolve early on is what makes one way of scheduling this system better than another? That is, if I have a schedule with Mrs Jones teaching Math at 8 and Mr Smith teaching Math at 9, is that better or worse than one with both of them teaching Math at 10? Is it better or worse than one with Mrs Jones teaching at 8 and Mr Jones teaching at 2? Why?
The main advice I'd give here is to divide the problem up as much as possible - maybe course by course, maybe teacher by teacher, maybe room by room - and work on solving the sub-problem first. There you should end up with multiple solutions to choose from, and need to pick one as the most likely optimal. Then, work on making the "earlier" sub-problems take into account the needs of later sub-problems in scoring their potential solutions. Then, maybe work on how to get yourself out of painted-into-the-corner situations (assuming you can't anticipate those situations in earlier sub-problems) when you get to a "no valid solutions" state.
A local-search optimization pass is often used to "polish" the end answer for better results.
Note that typically we are dealing with highly resource-constrained systems in school scheduling. Schools don't go through the year with a lot of empty rooms or teachers sitting in the lounge 75% of the day. Approaches which work best in solution-rich environments aren't necessarily applicable in school scheduling.
Generally, constraint programming is a good approach to this type of scheduling problem. A search on "constraint programming" and scheduling or "constraint based scheduling" both within stack overflow and on Google will generate some good references. It's not impossible - it's just a little hard to think about when using traditional optimization methods like linear or integer optimization. One output would be - does a schedule exist that satisfies all the requirements? That, in itself, is obviously helpful.
Good luck !
I have designed commercial algorithms for both class timetabling and examination timetabling. For the first I used integer programming; for the second a heuristic based on maximizing an objective function by choosing slot swaps, very similar to the original manual process that had been evolved. They main things in getting such solutions accepted are the ability to represent all the real-world constraints; and for human timetablers to not be able to see ways to improve the solution. In the end the algorithmic part was quite straightforward and easy to implement compared with the preparation of the databases, the user interface, ability to report on statistics like room utilization, user education and so on.
You can takle it with genetic algorithms, yes. But you shouldn't :). It can be too slow and parameter tuning can be too timeconsuming etc.
There are successful other approaches. All implemented in open source projects:
Constraint based approach
Implemented in UniTime (not really for schools)
You could also go further and use Integer programming. Successfully done at Udine university and also at University Bayreuth (I was involved there) using the commercial software (ILOG CPLEX)
Rule based approach with heuristisc - See Drools planner
Different heuristics - FET and my own
See here for a timetabling software list
I think you should use genetic algorithm because:
It is best suited for large problem instances.
It yields reduced time complexity on the price of inaccurate answer(Not the ultimate best)
You can specify constraints & preferences easily by adjusting fitness punishments for not met ones.
You can specify time limit for program execution.
The quality of solution depends on how much time you intend to spend solving the program..
Genetic Algorithms Definition
Genetic Algorithms Tutorial
Class scheduling project with GA
Also take a look at :a similar question and another one
This problem is MASSIVE where I work - imagine 1800 subjects/modules, and 350 000 students, each doing 5 to 10 modules, and you want to build an exam in 10 weeks, where papers are 1 hour to 3 days long... one plus point - all exams are online, but bad again, cannot exceed the system's load of max 5k concurrent. So yes we are doing this now in cloud on scaling servers.
The "solution" we used was simply to order modules on how many other modules they "clash" with descending (where a student does both), and to "backpack" them, allowing for these long papers to actually overlap, else it simply cannot be done.
So when things get too large, I found this "heuristic" to be practical... at least.
I don't know any one will agree with this code but i developed this code with the help of my own algorithm and is working for me in ruby.Hope it will help them who are searching for it
in the following code the periodflag ,dayflag subjectflag and the teacherflag are the hash with the corresponding id and the flag value which is Boolean.
Any issue contact me.......(-_-)
periodflag.each do |k2,v2|
if(TimetableDefinition.find(k2).period.to_i != 0)
subjectflag.each do |k3,v3|
if (v3 == 0)
if(getflag_period(periodflag,k2))
#teachers=EmployeesSubject.where(subject_name: #subjects.find(k3).name, division_id: division.id).pluck(:employee_id)
#teacherlists=Employee.find(#teachers)
teacherflag=Hash[teacher_flag(#teacherlists,teacherflag,flag).to_a.shuffle]
teacherflag.each do |k4,v4|
if(v4 == 0)
if(getflag_subject(subjectflag,k3))
subjectperiod=TimetableAssign.where("timetable_definition_id = ? AND subject_id = ?",k2,k3)
if subjectperiod.blank?
issubjectpresent=TimetableAssign.where("section_id = ? AND subject_id = ?",section.id,k3)
if issubjectpresent.blank?
isteacherpresent=TimetableAssign.where("section_id = ? AND employee_id = ?",section.id,k4)
if isteacherpresent.blank?
#finaltt=TimetableAssign.new
#finaltt.timetable_struct_id=#timetable_struct.id
#finaltt.employee_id=k4
#finaltt.section_id=section.id
#finaltt.standard_id=standard.id
#finaltt.division_id=division.id
#finaltt.subject_id=k3
#finaltt.timetable_definition_id=k2
#finaltt.timetable_day_id=k1
set_school_id(#finaltt,current_user)
if(#finaltt.save)
setflag_sub(subjectflag,k3,1)
setflag_period(periodflag,k2,1)
setflag_teacher(teacherflag,k4,1)
end
end
else
#subjectdetail=TimetableAssign.find_by_section_id_and_subject_id(#section.id,k3)
#finaltt=TimetableAssign.new
#finaltt.timetable_struct_id=#subjectdetail.timetable_struct_id
#finaltt.employee_id=#subjectdetail.employee_id
#finaltt.section_id=section.id
#finaltt.standard_id=standard.id
#finaltt.division_id=division.id
#finaltt.subject_id=#subjectdetail.subject_id
#finaltt.timetable_definition_id=k2
#finaltt.timetable_day_id=k1
set_school_id(#finaltt,current_user)
if(#finaltt.save)
setflag_sub(subjectflag,k3,1)
setflag_period(periodflag,k2,1)
setflag_teacher(teacherflag,k4,1)
end
end
end
end
end
end
end
end
end
end
end

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria.
Imagine you are managing newspaper delivery persons.
You have a set of street addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.
So... ideas?
UPDATE
The street network graph, as described in Arachnid's answer, is not available.
I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.
The algorithm works on a list if (x,y) coords ps that are specified as ints. It takes three other parameters as well:
radius (r): given a point, what is the radius for scanning for nearby points
max addresses (maxA): what are the maximum number of addresses (points) per cluster?
min addresses (minA): minimum addresses per cluster
Set limitA=maxA.
Main iteration:
Initialize empty list possibleSolutions.
Outer iteration: for every point p in ps.
Initialize empty list pclusters.
A worklist of points wps=copy(ps) is defined.
Workpoint wp=p.
Inner iteration: while wps is not empty.
Remove the point wp in wps. Determine all the points wpsInRadius in wps that are at a distance < r from wp. Sort wpsInRadius ascendingly according to the distance from wp. Keep the first min(limitA, sizeOf(wpsInRadius)) points in wpsInRadius. These points form a new cluster (list of points) pcluster. Add pcluster to pclusters. Remove points in pcluster from wps. If wps is not empty, wp=wps[0] and continue inner iteration.
End inner iteration.
A list of clusters pclusters is obtained. Add this to possibleSolutions.
End outer iteration.
We have for each p in ps a list of clusters pclusters in possibleSolutions. Every pclusters is then weighted. If avgPC is the average number of points per cluster in possibleSolutions (global) and avgCSize is the average number of clusters per pclusters (global), then this is the function that uses both these variables to determine the weight:
private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)
{
double weight = 0;
for (Cluster cluster : pclusters)
{
int ps = cluster.getPoints().size();
double psAvgPC = ps - avgPC;
weight += psAvgPC * psAvgPC / avgCSize;
weight += cluster.getSurface() / ps;
}
return new WeightedPClusters(pclusters, weight);
}
The best solution is now the pclusters with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC). End main iteration.
Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.
To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16, then we find 2 clusters of 16 addresses.
(source: paperboyalgorithm at sites.google.com)
Next, if we decrease the minimum number of addresses per cluster by setting minA=12, we find 3 clusters of 12/12/8 points.
(source: paperboyalgorithm at sites.google.com)
And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r here is only 5.
(source: paperboyalgorithm at sites.google.com)
Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.
Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.
(source: paperboyalgorithm at sites.google.com)
(source: paperboyalgorithm at sites.google.com)
What you are describing is a (Multi)-Vehicle-Routing-Problem (VRP). There's quite a lot of academic literature on different variants of this problem, using a large variety of techniques (heuristics, off-the-shelf solvers etc.). Usually the authors try to find good or optimal solutions for a concrete instance, which then also implies a clustering of the sites (all sites on the route of one vehicle).
However, the clusters may be subject to major changes with only slightly different instances, which is what you want to avoid. Still, something in the VRP-Papers may inspire you...
If you decide to stick with the explicit clustering step, don't forget to include your distribution in all clusters, as it is part of each route.
For evaluating the clusters using a graph representation of the street grid will probably yield more realistic results than connecting the dots on a white map (although both are TSP-variants). If a graph model is not available, you can use the taxicab-metric (|x_1 - x_2| + |y_1 - y_2|) as an approximation for the distances.
I think you want a hierarchical agglomeration technique rather than k-means. If you get your algorithm right you can stop it when you have the right number of clusters. As someone else mentioned you can seed subsequent clusterings with previous solutions which may give you a siginificant performance improvement.
You may want to look closely at the distance function you use, especially if your problem has high dimension. Euclidean distance is the easiest to understand but may not be the best, look at alternatives such as Mahalanobis.
I'm presuming that your real problem has nothing to do with delivering newspapers...
Have you thought about using an economic/market based solution? Divide the set up by an arbitrary (but constant to avoid randomness effects) split into even subsets (as determined by the number of delivery persons).
Assign a cost function to each point by how much it adds to the graph, and give each extra point an economic value.
Iterate allowing each person in turn to auction their worst point, and give each person a maximum budget.
This probably matches fairly well how the delivery people would think in real life, as people will find swaps, or will say "my life would be so much easier if I didn't do this one or two. It is also pretty flexible (for example, would allow one point miles away from any others to be given a premium fairly easily).
I would approach it differently: Considering the street network as a graph, with an edge for each side of each street, find a partitioning of the graph into n segments, each no more than a given length, such that each paperboy can ride a single continuous path from the start to the end of their route. This way, you avoid giving people routes that require them to ride the same segments repeatedly (eg, when asked to cover both sides of a street without covering all the surrounding streets).
This is a very quick and dirty method of discovering where your "clusters" lie. This was inspired by the game "Minesweeper."
Divide your entire delivery space up into a grid of squares. Note - it will take some tweaking of the size of the grid before this will work nicely. My intuition tells me that a square size roughly the size of a physical neighbourhood block will be a good starting point.
Loop through each square and store the number of delivery locations (houses) within each block. Use a second loop (or some clever method on the first pass) to store the number of delivery points for each neighbouring block.
Now you can operate on this grid in a similar way to photo manipulation software. You can detect the edges of clusters by finding blocks where some neighbouring blocks have no delivery points in them.
Finally you need a system that combines number of deliveries made as well as total distance travelled to create and assign routes. There may be some isolated clusters with just a few deliveries to be made, and one or two super clusters with many homes very close to each other, requiring multiple delivery people in the same cluster. Every home must be visited, so that is your first constraint.
Derive a maximum allowable distance to be travelled by any one delivery person on a single run. Next do the same for the number of deliveries made per person.
The first ever run of the routing algorithm would assign a single delivery person, send them to any random cluster with not all deliveries completed, let them deliver until they hit their delivery limit or they have delivered to all the homes in the cluster. If they have hit the delivery limit, end the route by sending them back to home base. If they could safely travel to the nearest cluster and then home without hitting their max travel distance, do so and repeat as above.
Once the route is finished for the current delivery person, check if there are homes that have not yet had a delivery. If so, assign another delivery person, and repeat the above algorithm.
This will generate initial routes. I would store all the info - the location and dimensions of each square, the number of homes within a square and all of its direct neighbours, the cluster to which each square belongs, the delivery people and their routes - I would store all of these in a database.
I'll leave the recalc procedure up to you - but having all the current routes, clusters, etc in a database will enable you to keep all historic routes, and also try various scenarios to see how to best to adapt to changes creating the least possible changes to existing routes.
This is a classic example of a problem that deserves an optimized solution rather than trying to solve for "The OPTIMUM". It's similar in some ways to the "Travelling Salesman Problem", but you also need to segment the locations during the optimization.
I've used three different optimization algorithms to good effect on problems like this:
Simulated Annealing
Great Deluge Algorithm
Genetic Algoritms
Using an optimization algorithm, I think you've described the following "goals":
The geographic area for each paper
boy should be minimized.
The number of subscribers served by
each should be approximately equal.
The distance travelled by each
should be about equal.
(And one you didn't state, but might
matter) The route should end where
it began.
Hope this gets you started!
* Edit *
If you don't care about the routes themselves, that eliminates goals 3 and 4 above, and perhaps allows the problem to be more tailored to your bonus requirements.
If you take demographic information into account (such as population density, subscription adoption rate and subscription cancellation rate) you could probably use the optimization techniques above to eliminate the need to rerun the algorithm at all as subscribers adopted or dropped your service. Once the clusters were optimized, they would stay in balance because the rates of each for an individual cluster matched the rates for the other clusters.
The only time you'd have to rerun the algorithm was when and external factor (such as a recession/depression) caused changes in the behavior of a demographic group.
Rather than a clustering model, I think you really want some variant of the Set Covering location model, with an additional constraint to cover the number of addresses covered by each facility. I can't really find a good explanation of it online. You can take a look at this page, but they're solving it using areal units and you probably want to solve it in either euclidean or network space. If you're willing to dig up something in dead tree format, check out chapter 4 of Network and Discrete Location by Daskin.
Good survey of simple clustering algos. There is more though:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html
Perhaps a minimum spanning tree of the customers, broken into set based on locality to the paper boy. Prims or Kruskal to get the MST with the distance between houses for the weight.
I know of a pretty novel approach to this problem that I have seen applied to Bioinformatics, though it is valid for any sort of clustering problem. It's certainly not the simplest solution but one that I think is very interesting. The basic premise is that clustering involves multiple objectives. For one you want to minimise the number of clusters, the trival solution being a single cluster with all the data. The second standard objective is to minimise the amount of variance within a cluster, the trivial solution being many clusters each with only a single data point. The interesting solutions come about when you try to include both of these objectives and optimise the trade-off.
At the core of the proposed approach is something called a memetic algorithm that is a little like a genetic algorithm, which steve mentioned, however it not only explores the solution space well but also has the ability to focus in on interesting regions, i.e. solutions. At the very least I recommend reading some of the papers on this subject as memetic algorithms are an unusual approach, though a word of warning; it may lead you to read The Selfish Gene and I still haven't decided whether that was a good thing... If algorithms don't interest you then maybe you can just try and express your problem as the format requires and use the source code provided. Related papers and code can be found here: Multi Objective Clustering
This is not directly related to the problem, but something I've heard and which should be considered if this is truly a route-planning problem you have. This would affect the ordering of the addresses within the set assigned to each driver.
UPS has software which generates optimum routes for their delivery people to follow. The software tries to maximize the number of right turns that are taken during the route. This saves them a lot of time on deliveries.
For people that don't live in the USA the reason for doing this may not be immediately obvious. In the US people drive on the right side of the road, so when making a right turn you don't have to wait for oncoming traffic if the light is green. Also, in the US, when turning right at a red light you (usually) don't have to wait for green before you can go. If you're always turning right then you never have to wait for lights.
There's an article about it here:
http://abcnews.go.com/wnt/story?id=3005890
You can have K means or expected maximization remain as unchanged as possible by using the previous cluster as a clustering feature. Getting each cluster to have the same amount of items seems bit trickier. I can think of how to do it as a post clustering step by doing k means and then shuffling some points until things balance but that doesn't seem very efficient.
A trivial answer which does not get any bonus points:
One delivery person for each address.
You have a set of street
addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is
assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed,
I can always hire more delivery
persons, or lay them off.
Each cluster should have about the same number of addresses. However,
a cluster may have less addresses if a
cluster's addresses are more spread
out. (Worded another way: minimum
number of clusters where each cluster
contains a maximum number of
addresses, and any address within
cluster must be separated by a maximum
distance.)
For bonus points, when the data set is altered (address added or
removed), and the algorithm is re-run,
it would be nice if the clusters
remained as unchanged as possible (ie.
this rules out simple k-means
clustering which is random in nature).
Otherwise the delivery persons will go
crazy.
As has been mentioned a Vehicle Routing Problem is probably better suited... Although strictly isn't designed with clustering in mind, it will optimize to assign based on the nearest addresses. Therefore you're clusters will actually be the recommended routes.
If you provide a maximum number of deliverers then and try to reach the optimal solution this should tell you the min that you require. This deals with point 2.
The same number of addresses can be obtained by providing a limit on the number of addresses to be visited, basically assigning a stock value (now its a capcitated vehicle routing problem).
Adding time windows or hours that the delivery persons work helps reduce the load if addresses are more spread out (now a capcitated vehicle routing problem with time windows).
If you use a nearest neighbour algorithm then you can get identical results each time, removing a single address shouldn't have too much impact on your final result so should deal with the last point.
I'm actually working on a C# class library to achieve something like this, and think its probably the best route to go down, although not neccesairly easy to impelement.
I acknowledge that this will not necessarily provide clusters of roughly equal size:
One of the best current techniques in data clustering is Evidence Accumulation. (Fred and Jain, 2005)
What you do is:
Given a data set with n patterns.
Use an algorithm like k-means over a range of k. Or use a set of different algorithms, the goal is to produce an ensemble of partitions.
Create a co-association matrix C of size n x n.
For each partition p in the ensemble:
3.1 Update the co-association matrix: for each pattern pair (i, j) that belongs to the same cluster in p, set C(i, j) = C(i, j) + 1/N.
Use a clustering algorihm such as Single Link and apply the matrix C as the proximity measure. Single Link gives a dendrogram as result in which we choose the clustering with the longest lifetime.
I'll provide descriptions of SL and k-means if you're interested.
I would use a basic algorithm to create a first set of paperboy routes according to where they live, and current locations of subscribers, then:
when paperboys are:
Added: They take locations from one or more paperboys working in the same general area from where the new guy lives.
Removed: His locations are given to the other paperboys, using the closest locations to their routes.
when locations are:
Added : Same thing, the location is added to the closest route.
Removed: just removed from that boy's route.
Once a quarter, you could re-calculate the whole thing and change all the routes.

Teacher time schedule algorithm

This is a problem I've had on my mind for a long time. Being the son of a teacher and a programmer, it occurred to me early on... but I still haven't found a solution for it.
So this is the problem. One needs to create a time schedule for a school, using some constraints. These are generally divided in two categories:
Sanity Checks
A teacher cannot teach two classes at the same time
A student cannot follow two lessons at the same time
Some teachers must have at least one day off during the week
All the days of the week should be covered by the time table
Subject X must have exactly so-and-so hours each week
...
Preferences
Each teacher's schedule should be as compact as possible (i.e. the teacher should work all hours for the day in a row with no pauses if possible)
Teachers that have days off should be able to express a preference on which day
Teachers that work part-time should be able to express a preference whether to work in the beginning or the end of the school day.
...
Now, after a few years of not finding a solution (and learning a thing or two in the meanwhile...), I realized that this smells like a NP-hard problem.
Is it proven as NP-hard?
Does anyone have an idea on how to crack this thing?
Looking at this question made me think about this problem, and whether genetic algorithms would be usable in this case. However it would be pretty hard to mutate possibilities while maintaining the sanity check rules. Also it's not clear to me how to distinguish incompatible requirements.
A small addendum to better specify the problem. This is applied to Italian school style classrooms where all students are associated in different classes (for example: year 1 section A) and the teachers move between classes. All students of the same class have the same schedule, and have no choice over which lessons to attend.
I am one of the developer that works on the scheduler part of a student information system.
During our original approach of the scheduling problem, we researched genetic algorithms to solve constraint satisfaction problems, and even though we were successful initially, we realized that there was a less complicated solution to the problem (after attending a school scheduling workshop)
Our current implementation works great, and uses brute force with smart heuristics to get a valid schedule in a short amount of time. The master schedule (assignment of the classes to the teachers) is first built, taking in consideration all the constraints that each teacher has while minimizing the possibility of conflicts for the students (based of their course requests). The students are then scheduled in the classes using the same method.
Doing this allows you to have the machine build a master schedule first, and then have a human tweak it if needed.
The scheduler current implementation is written in perl, but other options we visited early on were Prolog and CLIPS (expert system)
I think you might be missing some constraints.
One would prefer where possible to have teachers scheduled to classes for which they are certified.
One would suspect that the classes that are requested, and the expected headcount in each would be significant.
I think the place to start would be to list all of your constraints, figure out a data structure to represent them.
Then create some sort of engine to that builds a trial solution, then evaluates it for fitness according to the constraints.
You could then throw the fun genetic algorithms part at it, and see if you can get the fitness to increase over time as the genes mix.
Start with a small set of constraints, and increase them as you see success (if you see success)
There might be a way to take the constraints and shoehorn them together with something like a linear programming algorithm.
I agree. It sounds like a fun challenge
This is a mapping problem:
you need to map to every hour in a week and every teacher an activity (teach to a certain class or free hour ).
Split the problem:
Create the list of teachers, classes and preferences then let the user populate some of the preferences on a map to have as a starting point.
Randomly take one element from the list and put it at a random free position on the map
if it doesn't cross any sanity checks until the list is empty. If at any certain iteration you can't place an element on the map without crossing a sanity check shift two positions on the map and try again.
When the map is filled, try shifting positions on the map to optimize the result.
In steps 2 and 3 show each iteration to the user: items left in the list, positions on the map and the next computed move and let the user intervene.
I did not try this, but this would be my initial approach.
I've tackled similar planning/scheduling problems in the past and the AI technique that is often best suited for this class of problem is "Constraint Based Reasoning".
It's basically a brute force method like Laurenty suggested, but the approach involves applying constraints in an efficient order to cause the infeasible solutions to fail fast - to minimise the computation required.
Googling "Constraint Based Reasoning" brings up a lot of resources on the technique and its application to scheduling problems.
Answering my own question:
The FET project mentioned by gnud uses this algorithm:
Some words about the algorithm: FET
uses a heuristical algorithm, placing
the activities in turn, starting with
the most difficult ones. If it cannot
find a solution it points you to the
potential impossible activities, so
you can correct errors. The algorithm
swaps activities recursively if that
is possible in order to make space for
a new activity, or, in extreme cases,
backtracks and switches order of
evaluation. The important code is in
src/engine/generate.cpp. Please e-mail
me for details or join the mailing
list. The algorithm mimics the
operation of a human timetabler, I
think.
Link
Following up the "Constraint Based Reasoning" lead by Stringent Software on Wikipedia lead me to these pages which have an interesting paragraph:
Solving a constraint satisfaction
problem on a finite domain is an
NP-complete problem in general.
Research has shown a number of
polynomial-time subcases, mostly
obtained by restricting either the
allowed domains or constraints or the
way constraints can be placed over the
variables. Research has also
established relationship of the
constraint satisfaction problem with
problems in other areas such as finite
model theory and databases.
This reminds me of this blog post about scheduling a conference, with a video explanation here.
How I would do it:
Have the population include two things:
Who teaches what class (I expect the teachers to teach one subject).
What a class takes on a specific time slot.
This way we can't have conflicts (a teacher in 2 places, or a class having two subjects at the same time).
The fitness function would include:
How many time slots each teacher gives per week.
How many time slots a teacher has on the same day (They can't have a full day of teaching, this too must be balanced).
How many time slots of the same subject a class has on the same day (They can't have a full day of math!).
Maybe take the standard deviation for all of them since they should be balanced.
Looking at this question made me think
about this problem, and whether
genetic algorithms would be usable in
this case. However it would be pretty
hard to mutate possibilities while
maintaining the sanity check rules.
Also it's not clear to me how to
distinguish incompatible requirements.
Genetic Algorithms are very well suited to problems such as this. Once you come up with a decent representation of the chromosome (in this case, probably a vector representing all of the available class slots) you're most of the way there.
Don't worry about maintaining sanity checks during the mutation phase. Mutation is random. Sanity and preference checks both belong in the selection phase. A failed sanity check would drastically lower the fitness of an individual, while a failed preference would only mildly lower the fitness.
Incompatible requirements are a different problem altogether. If they're completely incompatible, you'll get a population that doesn't converge on anything useful.
Good luck. Being the son of a father with this sort of problem is what took me to the research group that I ended up in ...
When I was a kid my father scheduled match officials in a local sports league, this had a similarly long list of constraints and I tried to write something to help. When I got to University I even used it as my final year project eventually settling on a Monte Carlo implementation (using a Simulated Annealing model).
The basic idea is that if it's not NP, it's pretty close, so rather than assuming there is a solution, I would set out to find the best within a given timeframe. I would weight all the constraints with costs for breaking them: sanity checks would have huge costs, the preferences would have lower costs (but increasing for more breaks, so breaking it once would be less than half the cost of breaking it twice).
The basic idea is that I started with a 'random' solution and costed it; then made changes by swapping a small number of assignments, re-valued it and then, probalistically accepted or declined the change.
After thousands of iterations you inch closer to an acceptable solution.
Believe me, though, that this class of problems has research groups churning out PhDs so you're in very good company.
You might also find some interest in the Linear Programming arena, e.g. simplex and so on.
Yes, I think this is NP complete - or at least to find the optimal solution is NP complete.
I worked on a similar problem in college when i told a friend's father (who was a teacher) that I could solve his scheduling problems for him if he did not find a suitable program for it (this was back in 1990 or so)
I had no idea what I got myself into. Luckily for us all I had to do was find one solution that fit the constraints. But in my testing I was always worried about determining IF there was a solution at all. He never had too many constraints and the program used different heuristics and back tracking. It was a lot of fun.
I think Bill Gates also worked on a system like this in high school or college for his high school. Not sure though.
Good luck. All my notes are gone and I never got around to implementing a solution that I could market. It was a specialty project that I re-coded as I learned new languages (Basic, Scheme, C, VB, C++)
Have fun with it
i see that this problem can be solved by Prolog program by connecting it to a database
and the program can make the schedule given a set of constraints
read abt "Constraint satisfaction Problem prolog"

Resources