DEAP Genetic Algorithm to solve my particular timetable problem - genetic-algorithm

I would like to use DEAP to solve my timetable problem.
I have the following parameters:
n teachers (around 15)
m rooms where teachers (around 8) teach some lessons
8 hours per day / 5 days per week so 40 hours per room
I have many constraints:
teachers have min, max hours per week per room: min_week, max_week
teachers have max hours per day: max_day
not all teachers are assigned to all rooms...
it is something like this:
Room1 : teach1 [from min_week=10 to max_week=12], teach2 [10, 12], teach7[21,22]
Room2 : teach1[10, 12], teach2[10, 12], teach5[18, 22]
Room3 : teach3[19, 22], teach4[19, 22]
and so on...
There are other constraints but I don't want to make it more difficult.
I have seen online example with teachers, classroom, group of students, rooms.
I have less elements but maybe more strict contraints, I mean very few solutions among huge amount of combinations (according to my structure).
I tried to solve it creating a chromosome composed by 40 integer (that are the total amount of hours per week for each room). Each integer is between 1 and the number of combination of all teachers among rooms.
The combinations are genereted avoiding the assignement of a teacher on more than one room.
But for each gene I have around 3600 combinations for 40 genes of the chromosome so the space of all possible solutions is around 10^142.
If I run my code with a population of 400 individuals and 2000 generations the solution is far from an acceptable one.
I think it is due to the weakness of the chromosome structure that takes into consideration a huge amount of unfeasible solutions.
Is a space of 10^142 of possible chromosome too big to be solved by DEAP program?
Is there any possible improvement of the chromosome structure in order to make the space of possibile chromosome smaller?

Related

Interview Question: Maximize stock profits

So I came across this question during an interview and was unable to solve it. Hopefully, someone can advise. The problem is this.
Imagine you have S amounts of savings in integers. You are thinking of buying stocks. Someone has provided you with 2 N-sized arrays of stock purchase prices, and their sell prices the following day. Write an algorithm that is able to take in S and the 2 arrays, and return the maximum profits that you are able to achieve the following day. Note that the lengths of the 2 arrays are equal. The number at index i in the first array shows the purchase price of the ith stock and the number at index i in the second array shows the sell price of the ith stock.
eg. S = 10, buy_price = [4, 6, 5], sell_price = [6, 10, 7], your savings are 10, there are 3 stock options. The first option the purchasing price is 4, and the selling price is 6 the next day. The second option purchasing price is 6 and the selling price is 10 the next day. So on and so forth. The maximum profit here is 6 where you buy the stock options that cost 4 and 6. and sell them the next day. Your function should return 6 here.
My initial approach was finding the profits/cost ratio of each stock and sorting them. However, this may not lead to the most ideal stock purchases. For example, for this case S = 10, buy_price = [6, 5, 5], sell_price = [12, 9, 8], the best option is not to buy the stock that cost 6 even though it has the highest profits/cost ratio (you can't buy anything with your remaining 4 savings) but to buy the other 2 stock options for a maximum profit of 7.
Does anyone have any idea of how to tackle this problem? Thank you!
If we consider the prices to be weights and the profits to be value, this problem is exactly the 0/1 knapsack problem. This problem is weakly NP-complete. That is, there is no polynomial time algorithm as a function of the size of the input, but with dynamic programming you can solve this problem in polynomial time as a function of the budget.
Thus, if the budget is reasonably small, this problem is efficiently solvable with a dynamic programming approach.
I think the problem should be able to be solved by integer programming, which can be set up by using the Excel solver. I think the problem itself can't be solved by a greedy algorithm, but please correct me if I am wrong.
Yulian

Algorithm for heavily restricted Knapsack problem

I have a following problem to solve:
We have 53 weeks in a year, for each week we need to choose one model from the list: [A1, A2,....,F149, F150]. In total around 750 models in 6 classes: A,B,C,D,E,F.
Models can repeat and each has a specific value from around 3 to 10 and a weight. The goal is to achieve a target total value of 280+-5% with a minimal weight by the end of the year.
However there are a ton of restrictions. For example:
Models must be held for at least 4 weeks in a row. If we have chosen A1 for week 1, then we need to choose A1 for weeks 2,3,4;
If we have chosen model classes E,F then, after they end, we cannot choose E, F for another 4 weeks.
Throughout the year we can only choose 23 models of class D.
an so on
What I've tried so far:
Based on a target value create a corridor of allowed values throughout the year:
Corridor looks like this
Starting at week 1, choose a random model for the week from the list of allowed models -> Based on the choice modify the list of allowed models for next weeks
If our choice satisfies the criterion (also lies within a corridor), then week+=1. If not, delete this possibility.
If there is no more models for this week, go back one week, delete the possibility we have chosen before and choose random from what's left.
Pictorially the algorithm is like following the branches of a tree. If the branch is bad, return back to the fork and cut off the bad branch.
This algorithm can generate a random valid solution (in about 5 to 80 minutes with a mean time of 25 minutes). Then I need to generate more of those and choose one that has the least weight. Which is not a very good approach, I presume.
Question
The question is: what is the optimal way to solve the problem? The priority is to find the solution with a minimal weight and a target value and not the fastest algorithm. But it should at least end in a final amount of time =)
The problem statement above is a bit oversimplified and due to the complexity of calculations and the amount of combinations, there is no way to consider and compare all combinations.

Optimal sequence of non-overlapping purchases

I think this is a scheduling problem, but I'm not even sure on that much! What I want is to find the optimal sequence of non-overlapping purchase decisions, when I have full knowledge of their value and what opportunities are coming up in the future.
Imagine a wholesaler who sells various goods that I want to buy for my own shop. At any time they may have multiple special offers running; I will sell at full price, so their discount is my profit.
I want to maximize profit, but the catch is that I can only buy one thing at a time, and no such thing as credit, and worse, there is a delivery delay. The good news is I will sell the items as soon as they are delivered, and I can then go spend my money again. So, one path through all the choices might be: I buy 100kg apples on Monday, they are delivered on Tuesday. I then buy 20 nun costumes delivered, appropriately enough, on Sunday. I skip a couple of days, as I know on Wednesday they'll have a Ferrari at a heavy discount. So I buy one of those, it is delivered the following Tuesday. And so on.
You can consider compounding profits or not. The algorithm comes down to a decision at each stage between choosing one of today's special offers, or waiting a day because something better is coming tomorrow.
Let's abstract that a bit. Buy and delivery become days-since-epoch. Profit is written as sell-price divided by buy-price. I.e. 1.00 means break-even, 1.10 means a 10% profit, 2.0 means I doubled my money.
buy delivery profit
1 2 1.10 Apples
1 3 1.15 Viagra
2 3 1.15 Notebooks
3 7 1.30 Nun costumes
4 7 1.28 Priest costumes
6 7 1.09 Oranges
6 8 1.11 Pears
7 9 1.16 Yellow shoes
8 10 1.15 Red shoes
10 15 1.50 Red Ferrari
11 15 1.40 Yellow Ferrari
13 16 1.25 Organic grapes
14 19 1.30 Organic wine
NOTES: opportunities exist only on the buy day (e.g. the organic grapes get made into wine if no-one buys them!), and I get to sell on the same day as delivery, but cannot buy my next item until the following day. So I cannot sell my nun costumes at t=7 and immediately buy yellow shoes at t=7.
I was hoping there exists a known best algorithm, and that there is already an R module for it, but algorithms or academic literature would also be good, as would anything in any other language. Speed matters, but mainly when the data gets big, so I'd like to know if it is O(n2), or whatever.
By the way, does the best algorithm change if there is a maximum possible delivery delay? E.g. if delivery - buy <= 7
Here is the above data as CSV:
buy,delivery,profit,item
1,2,1.10,Apples
1,3,1.15,Viagra
2,3,1.15,Notebooks
3,7,1.30,Nun costumes
4,7,1.28,Priest costumes
6,7,1.09,Oranges
6,8,1.11,Pears
7,9,1.16,Yellow shoes
8,10,1.15,Red shoes
10,15,1.50,Red Ferrari
11,15,1.40,Yellow Ferrari
13,16,1.25,Organic grapes
14,19,1.30,Organic wine
Or as JSON:
{"headers":["buy","delivery","profit","item"],"data":[[1,2,1.1,"Apples"],[1,3,1.15,"Viagra"],[2,3,1.15,"Notebooks"],[3,7,1.3,"Nun costumes"],[4,7,1.28,"Priest costumes"],[6,7,1.09,"Oranges"],[6,8,1.11,"Pears"],[7,9,1.16,"Yellow shoes"],[8,10,1.15,"Red shoes"],[10,15,1.5,"Red Ferrari"],[11,15,1.4,"Yellow Ferrari"],[13,16,1.25,"Organic grapes"],[14,19,1.3,"Organic wine"]]}
Or as an R data frame:
structure(list(buy = c(1L, 1L, 2L, 3L, 4L, 6L, 6L, 7L, 8L, 10L,
11L, 13L, 14L), delivery = c(2L, 3L, 3L, 7L, 7L, 7L, 8L, 9L,
10L, 15L, 15L, 16L, 19L), profit = c(1.1, 1.15, 1.15, 1.3, 1.28,
1.09, 1.11, 1.16, 1.15, 1.5, 1.4, 1.25, 1.3), item = c("Apples",
"Viagra", "Notebooks", "Nun costumes", "Priest costumes", "Oranges",
"Pears", "Yellow shoes", "Red shoes", "Red Ferrari", "Yellow Ferrari",
"Organic grapes", "Organic wine")), .Names = c("buy", "delivery",
"profit", "item"), row.names = c(NA, -13L), class = "data.frame")
LINKS
Are there any R Packages for Graphs (shortest path, etc.)? (igraph offers a shortest.paths function and in addition to the C library, has an R package and a python interface)
The simplest way to think of this problem is as analogous to a shortest-path problem (although treating it as a maximum flow problem probably is technically better). The day numbers, 1 ... 19, can be used as node names; each node j has a link to node j+1 with weight 1, and each product (b,d,g,p) in the list adds a link from day b to day d+1 with weight g. As we progress through the nodes when path-finding, we keep track of the best multiplied values seen so far at each node.
The Python code shown below runs in time O(V+E) where V is the number of vertices (or days), and E is the number of edges. In this implementation, E = V + number of products being sold. Added note: The loop for i, t in enumerate(graf): treats each vertex once. In that loop, for e in edges: treats edges from the current vertex once each. Thus, no edge is treated more than once, so performance is O(V+E).
Edited note 2: krjampani claimed that O(V+E) is slower than O(n log n), where n is the number of products. However, the two orders are not comparable unless we make assumptions about the number of days considered. Note that if delivery delays are bounded and product dates overlap, then number of days is O(n) whence O(V+E) = O(n), which is faster than O(n log n).
However, under a given set of assumptions the run time orders of my method and krjampani's can be the same: For large numbers of days, change my method to create graph nodes only for days in the sorted union of x[0] and x[1] values, and using links to day[i-1] and day[i+1] instead of to i-1 and i+1. For small numbers of days, change krjampani's method to use an O(n) counting sort.
The program's output looks like the following:
16 : 2.36992 [11, 15, 1.4, 'Yellow Ferrari']
11 : 1.6928 [8, 10, 1.15, 'Red shoes']
8 : 1.472 [4, 7, 1.28, 'Priest costumes']
4 : 1.15 [1, 3, 1.15, 'Viagra']
which indicates that we arrived at day 16 with compounded profit of 2.36992, after selling Yellow Ferrari's on day 15; arrived at day 11 with profit 1.6928, after selling Red shoes; and so forth. Note, the dummy entry at the beginning of the products list, and removal of quotes around the numbers, are the main differences vs the JSON data. The entry in list element graf[j] starts out as [1, j-1, 0, [[j+1,1,0]]], that is, is of form [best-value-so-far, best-from-node#, best-from-product-key, edge-list]. Each edge-list is a list of lists which have form [next-node#, edge-weight, product-key]. Having product 0 be a dummy product simplifies initialization.
products = [[0,0,0,""],[1,2,1.10,"Apples"],[1,3,1.15,"Viagra"],[2,3,1.15,"Notebooks"],[3,7,1.30,"Nun costumes"],[4,7,1.28,"Priest costumes"],[6,7,1.09,"Oranges"],[6,8,1.11,"Pears"],[7,9,1.16,"Yellow shoes"],[8,10,1.15,"Red shoes"],[10,15,1.50,"Red Ferrari"],[11,15,1.40,"Yellow Ferrari"],[13,16,1.25,"Organic grapes"],[14,19,1.30,"Organic wine"]]
hiDay = max([x[1] for x in products])
graf = [[1, i-1, 0, [[i+1,1,0]]] for i in range(2+hiDay)]
for i, x in enumerate(products):
b, d, g, p = x[:]
graf[b][3] += [[d+1, g, i]] # Add an edge for each product
for i, t in enumerate(graf):
if i > hiDay: break
valu = t[0] # Best value of path to current node
edges = t[3] # List of edges out of current node
for e in edges:
link, gain, prod = e[:]
v = valu * gain;
if v > graf[link][0]:
graf[link][0:3] = [v, i, prod]
day = hiDay
while day > 0:
if graf[day][2] > 0:
print day, ":\t", graf[day][0], products[graf[day][2]]
day = graf[day][1]
This problem maps naturally to the problem of finding the maximum weight independent intervals among a set of weighted intervals. Each item in your input set corresponds to an interval whose start and end points are the buy and delivery dates and the item's profit represents the weight of the interval. The maximum weight independent intervals problem is to find a set of disjoint intervals whose total weight is the maximum.
The problem can be solved in O(n log n) as follows. Sort the intervals by their end points (see the figure). We then travel through each interval i in the sorted list and compute the optimal solution for the subproblem that consists of intervals from 1...i in the sorted list. The optimal solution of the problem for intervals 1...i is the maximum of:
1. The optimal solution of the problem for intervals `1...(i-1)` in the
sorted list or
2. Weight of interval `i` + the optimal solution of the problem for intervals
`1...j`, where j is the last interval in the sorted list whose end-point
is less than the start-point of `i`.
Note that this algorithm runs in O(n log n) and computes the value of the optimal solution for every prefix of the sorted list.
After we run this algorithm, we can travel through the sorted-list in reverse order and find the intervals present in the optimal solution based on the values computed for each prefix.
EDIT:
For this to work correctly the weights of the intervals should be the actual profits of the corresponding items (i.e. they should be sell_price - buy_price).
Update 2: Running time
Let V be the number of days (following jwpat7's notation).
If V is much smaller than O(n log n), we can use the counting sort to sort the intervals in O(n + V) time and use an array of size V to record the solutions to the subproblems. This approach results in a time complexity of O(V + n).
So the running time of the algorithm is min(O(V+n), O(n log n)).
This is a dynamic programming problem. Making an overall optimal choice only requires making optimal choices at each step. You can make a table that describes the optimal choice at each step based on the previous state and the profit of taking various steps from that state. You can collapse a large set of possibilities into a smaller set by eliminating the possibilities that are clearly non-optimal as you go.
In your problem, the only state that affects choices is the delivery date. For example, on day one, you have three choices: You can buy apples, set your profit to 1.10, and set your delivery date to 2; buy viagra, set your profit to 1.15, and set your delivery date to 3; or buy nothing, set your profit to zero, and set your delivery date to 2. We can represent these alternatives like this:
(choices=[apples], delivery=2, profit=1.10) or
(choices=[viagra], delivery=3, profit=1.15) or
(choices=[wait], delivery=2, profit=0.00)
It isn't going to make any difference whether you buy viagra or buy nothing on the first day as far as making future decisions. Either way, the next day you can make a purchase is day two, so you can eliminate waiting as an alternative since the profit is lower. However, if you buy apples, that will affect future decisions differently than if you buy viagra or wait, so it is a different alternative you have to consider. That just leaves you with these alternatives at the end of day one.
(choices=[apples], delivery=2, profit=1.10) or
(choices=[viagra], delivery=3, profit=1.15)
For day two, you need to consider your alternatives based on what the alternatives were on day one. This produces three possibilties:
(choices=[apples,notebooks], delivery=3, profit=2.25) or
(choices=[apples,wait], delivery=3, profit=1.10) or
(choices=[viagra,wait], delivery=3, profit=1.15)
All three of these choices put you in the same state as far as future decisions are considered, since they all put the delivery date at 3, so you simply choose the one with maximum profit:
(choices=[apples,notebooks], delivery=3, profit=2.25)
Going on to day three, you have two alternatives
(choices=[apples,notebooks,wait], delivery=4, profit=2.25)
(choices=[apples,notebooks,nun costumes], delivery=7, profit=3.55)
both of these alternatives have to be kept, since they will affect future decisions in different ways.
Note that we're just making future decisions based on the delivery date and the profit. We keep track of the choices just so that we can report the best set of choices at the end.
Now maybe you can see the pattern. You have a set of alternatives, and whenever you have multiple alternatives that have the same delivery date, you just choose the one with the maximum profit and eliminate the others. This process of collapsing your alternatives keeps the problem from growing exponentially, allowing it to be solved efficiently.
You can solve this as a linear programming problem. This is the standard approach to solving logistics problems, such as those faced by airlines and corporations, with much larger problem spaces than yours. I won't formally define your problem here, but in broad terms: Your objective function is the maximisation of profit. You can represent the buy days, and the "only one purchase per day" as linear constraints.
The standard linear programming algorithm is the simplex method. Although it has exponential worst case behaviour, in practice, it tends to be very efficient on real problems. There are lots of good freely available implementations. My favourite is the GNU Linear Programming Kit. R has an interface to GLPK. Lp_solve is another well-known project, which also has an R interface. The basic approach in each case is to formally define your problem, then hand that off to the third party solver to do its thing.
To learn more, I recommend you take a look at the Algorithm Design Manual, which should give you enough background to do further research online. p.412 onwards is a thorough summary of linear programming and its variations (e.g. integer programming if you have integrality constraints).
If you've never heard of linear programming before, you might like to take a look at some examples of how it can be used. I really like this simple set of tutorial problems in Python. They include maximising profit on tins of cat food, and solving a Sudoku problem.

Find all possible combinations of a scores consistent with data

So I've been working on a problem in my spare time and I'm stuck. Here is where I'm at. I have a number 40. It represents players. I've been given other numbers 39, 38, .... 10. These represent the scores of the first 30 players (1 -30). The rest of the players (31-40) have some unknown score. What I would like to do is find how many combinations of the scores are consistent with the given data.
So for a simpler example: if you have 3 players. One has a score of 1. Then the number of possible combinations of the scores is 3 (0,2; 2,0; 1,1), where (a,b) stands for the number of wins for player one and player two, respectively. A combination of (3,0) wouldn't work because no person can have 3 wins. Nor would (0,0) work because we need a total of 3 wins (and wouldn't get it with 0,0).
I've found the total possible number of games. This is the total number of games played, which means it is the total number of wins. (There are no ties.) Finally, I have a variable for the max wins per player (which is one less than the total number of players. No player can have more than that.)
I've tried finding the number of unique combinations by spreading out N wins to each player and then subtracting combinations that don't fit the criteria. E.g., to figure out many ways to give 10 victories to 5 people with no more than 4 victories to each person, you would use:
C(14,4) - C(5,1)*C(9,4) + C(5,2)*C(4,4) = 381. C(14,4) comes from the formula C(n+k-1, k-1) (google bars and strips, I believe). The next is picking off the the ones with the 5 (not allowed), but adding in the ones we subtracted twice.
Yeah, there has got to be an easier way. Lastly, the numbers get so big that I'm not sure that my computer can adequately handle them. We're talking about C(780, 39), which is 1.15495183 × 10^66. Regardless, there should be a better way of doing this.
To recap, you have 40 people. The scores of the first 30 people are 10 - 39. The last ten people have unknown scores. How many scores can you generate that are meet the criteria: all the scores add up to total possible wins and each player gets no more 39 wins.
Thoughts?
Generating functions:
Since the question is more about math, but still on a programming QA site, let me give you a partial solution that works for many of these problems using a symbolic algebra (like Maple of Mathematica). I highly recommend that you grab an intro combinatorics book, these kind of questions are answered there.
First of all the first 30 players who score 10-39 (with a total score of 735) are a bit of a red herring - what we would like to do is solve the other problem, the remaining 10 players whose score could be in the range of (0...39).
If we think of the possible scores of the players as the polynomial:
f(x) = x^0 + x^1 + x^2 + ... x^39
Where a value of x^2 is the score of 2 for example, consider what this looks like
f(x)^10
This represents the combined score of all 10 players, ie. the coefficent of x^385 is 2002, which represents the fact that there are 2002 ways for the 10 players to score 385. Wolfram Alpha (a programming lanuage IMO) can evaluate this for us.
If you'd like to know how many possible ways of doing this, just substitute in x=1 in the expression giving 8,140,406,085,191,601, which just happens to be 39^10 (no surprise!)
Why is this useful?
While I know it may seem silly to some to set up all this machinery for a simple problem that can be solved on paper - the approach of generating functions is useful when the problem gets messy (and asymptotic analysis is possible). Consider the same problem, but now we restrict the players to only score prime numbers (2,3,5,7,11,...). How many possible ways can the 10 of them score a specific number, say 344? Just modify your f(x):
f(x) = x^2 + x^3 + x^5 + x^7 + x^11 ...
and repeat the process! (I get [x^344]f(x)^10 = 1390).

Adjusting votes based on different numbers of voters

I have a 1 to 5 voting system and i'm trying to figure out the best way to find the most popular item voted on, taking into consideration the total possible number of votes cast. To get a vote total, i'm counting "1" votes as -3, "2" votes as -2, "3" votes as +1, "4" votes as +2, "5" votes as +3, so a "1" vote would cancel out a "5" vote and vice versa.
For this example, say we have 3 films playing in 3 different size theaters.
Film 1: 800 seats / Film 2: 400 seats / Film 3: 180 seats
In a way, we're limiting the total amount of votes based on seats, so I would like a way for the film in the smaller theater to not get automatically overwhelmed by the film in the larger theater. It's likely that there will be more votes cast in the larger theater, resulting in a higher total score.
Edit 10/18:
Alright, hopefully I can explain this better. I'm working for a film festival, and we're balloting the first screening of each film in the fest. Therefore, each film will have from 0 to a maximum number of votes based on the size of each theater. I'm looking to find the most popular film in 3 categories: narrative, documentary, short film. By popular I mean a combination of highest average vote and number of votes.
It seems like a weighted average is what i'm looking for, giving less weight to votes from a bigger theater and more weight to votes from a smaller theater to even things out.
You're working with weighted averages.
Instead of just adding up and dividing by the total number of elements (arithmetic mean):
a + b + c
---------
3
You are adding weights to each element, as they are not all evenly distributed:
w1*a + w2*b + w3*c
------------------
3
In your case, the weights could be this:
# of people in current theater
--------------------------------
# of people in all the theaters
Let's try a test case:
Theater 1: 100 people (rating: 1)
Theater 2: 1,000,000 people (rating: 5)
Average = (100 / (100 + 1000000)) * 1 + (1000000/(100 + 1000000)) * 5
-----------------------------------------------------------
2
= 2.49980002
Well, depending on your goals it sounds like you are interested in some sort of weighted average.
Continuing your film example, it sounds to me like you are trying to rate how "good" the films are. To do this, you don't want to factor the number of views of any particular film too highly into the final determination. However, you have to take it into account somewhat since a film that only got viewed 5 times and had an average rating of +2.7 has much less credibility than a film with 10,000 views getting the same rating.
You might consider simply not including a film in the results unless it has a minimum number of votes.
Given a uniform (even) distribution of votes across {1,2,3,4,5}, the expected rating of your film is 0.2. This is because the the votes {1 and 5} cancel eachother out, as do {2 and 4}. But the vote 3 has an expected value of 1/5 = 0.2. So if people give a rating of {1,2,3,4,5} with equal probability, then you would expect a film (no matter how many people see it) to have an average rating close to 0.2.
So I think the best option for you would be to add up all the scores received and simply divide by the number of people who have seen each film. This should be a good guess at people's sentiment toward the film as the average of the distribution should not get larger simply because more people see the film.
If I were you, I would also suggest adding a small penalty term to your final result, to take into account the fact that some people didn't even want to go see the movie. If lots of people didn't want to see the movie in the first place, but the 5 or so people that saw it gave it a 5* rating, that doesn't make it a good movie, does it?
So a final solution I would recommend: Add up all the points as you have described, and divide by the total number of people who have gone to the cinema. While not perfect (whatever perfect means), it should give you some indication of what people like and don't like. This essentially means people who chose not to see a movie are adding zero to the points total, but still affect the average because the end result is divided by a larger number.

Resources