I want to place a set of rectangles into an area with a given width and minimize the total height. I have made a solution in miniZinc, but the running time is quite long.
The final result came after a couple of seconds, but the running didn't stop until 10 minutes had passed. Is it a way to make the running time a bit faster/ improve the performance? And should I use cumulative constraints ?
This appears to be a straightforward Rectangle Packing Problem. Algorithms that work well for this specific problem, including variations such as Square Packing, Rectilinear packing with/without rotations.
In MiniZinc in particular this is a topic discussed in the second MiniZinc Coursera course: Advanced Modeling for Discrete Optimization. This should give you all required knowledge to create a good model for the problem.
Here are some comments in addition to #Dekker1's answer.
First some questions: Which solver did you use? Was the first answer found directly the optimal solution?
You might get a faster solve time with some other FlatZinc solver. I tested the model that you originally included in the question (and later removed) and with some different FlatZinc solvers. Also, I printed out the objective value (makespan) to compare the intermediate solutions.
Gecode: Finds a solution immediately with a makespan of 69, but then takes long time find the optimal value (which is 17). After 15 minutes no improvement was done, and I stopped the run. For Gecode you might get (much) better result with different search strategies, see more on this here: https://www.minizinc.org/doc-2.3.1/en/lib-annotations.html#search-annotations .
Chuffed: finds a makespan of 17 almost directly, but it took in all 8min28s to prove that 17 is the optimal value. Testing with free search is not faster (9min23s).
OR-tools: Finds the makespan (and proves that it's optimal) of 17 in 0.6s.
startX: [0, 0, 0, 3, 3, 13, 0, 14, 6, 9, 4, 6, 0, 10, 7, 15]
startY: [0, 3, 7, 0, 6, 5, 12, 0, 2, 6, 12, 0, 15, 12, 12, 12]
makespan: 17
----------
==========
The OR-tools solver can sometimes be faster when using free search (the -f flag), but in this case it's slower: 4.2s. At least with just 1 thread. When adding some more threads (here 12), the optimal solutions was found in 0.396s with the free search flag.
There are quite a lot of different FlatZinc solver that one can test. See the latest MiniZinc Challenge page for some of them: https://www.minizinc.org/challenge2021/results2021.html .
Regarding cumulative, it seems that some solvers might be faster with this constraint, but some are slower. The best way is to compare with and without the constraint on some different problem instances.
In summary, one might occasionally have to experiment with different constraints and/or solvers and/or search strategies.
Related
I am solving a scheduling task in SWI Prolog using the CLPFD library. Since it is the first time I solve something more serious than was the sendmory I probably need some good advices from more experienced users. Let me briefly describe the domain/the task.
Domain
I have a "calendar" for a month. Everyday there are 2 for the whole day, 2 for the whole night (long 12h service). There are also, only Mon-Fri 10 more workers for 8 hours (short service).
The domain constraints are, obviously:
There are no consecutive services (no day after night and vice versa, no short day service after night)
On worker can serve up to 2 consecutive night services in a row
Each worker has a limited amount of hours for a month
There are 19 workers available
My approach is as following:
Variables
For every field in the calendar I have a variable defined:
DxD_y where x is number of the day and y is 1 or 2 for the long day service
DxN_y where x is number of the day and y is 1 or 2 for the long night service
DxA_y where x is number of the day and y is 0 .. 9 for the short day service
SUM_x where x is a worker number (1..19) denoting sum of hours for a worker
Each of the D variables has a domain 1..19. To simplify it for now, SUM_X #=< 200 for each X.
Constraints
all_distinct() for each variable for the same day - each worker can serve only one service/day
global_cardinality() to count number of occurrences for each number 1..19 for list with short services and long services - this defines variables LSUM_X and SSUM_X - number of occurrences of worker X in Long or Short services
SUM_X #= 12*LSUM_X + 8*SSUM_X for each worker
DxN_y #\= Dx+1D_z - to avoid long day service after a night one
bunch of similar constraints like the one above to cover all the domain constraints
DxNy #= Dx+1Ny #==> DxNy #\= Dx+2Ny - to avoid three consecutive night services, there are constraints each combination of x and y
Notes
All the variables and constraints are directly stated in the pl script. I do not use prolog predicates to generate constraint - because I have a model in a .NET application (frontend) and I can easily generate all the stuff from the .NET code into a prolog code.
I think my approach is overall good. Running the scheduler on some smaller example works well (7 days, 4 long services, 1 short service, 8 workers). Also I was able to get some valid results on the full blown case - 30 days, 19 workers, 4 long and 10 short services per day.
However, I am not completely satisfied with the current status. Let me explain why.
Questions
I read some articles about modelling scheduling problems and some of them uses a bit different approach - introducing only boolean variables for each combination of my variable (calendar field) and worker to flag if the worker is assigned to a particular calendar field. Is this a better approach?
If you calculate the overall amount-of-work limits and overall hour in calendar you find out that the workers are not 100% utilized. But the solver creates the solution most likely in this way: utilize the first worker for 100% and then grab the next one. So the SUMs in the solution appears like [200,200,200...200,160,140,80,50,0,]. I would be glad if the workers will be more or less equally utilized. Is there some simple/efficient way how to achieve that? I considered defining somewhat like define the differences between workers and minimize it, but it sound very complex for me and I am afraid I would take ages to compute that. I use labeling([random_variable(29)], Vars), but it only reorders the variables, so there are still these gaps, only in different order. Probably I want the labeling procedure will take the values in some other order than up or down (in some pseudo-random way).
How should I order the constraints? I think the order of constraints matters with respect to efficiency of the labeling.
How to debug/optimize performance of labeling? I hoped solving this type of task will take some seconds or maximally a couple of minutes in case very tight conditions for sums. For example labeling with the the bisect option took ages.
I can provide some more code examples if needed.
That's a lot of questions, let me try to address some.
... introducing only boolean variables for each combination of my variable (calendar field) and worker to flag if the worker is assigned to a particular calendar field. Is this a better approach?
This is typically done when a MILP (Mixed Integer Linear Programming) solver is used, where higher-level concepts (such as alldifferent etc) have to be expressed as linear inequalities. Such formulations then usually require lots of boolean variables. Constraint Programming is more flexible here and offers more modelling choices, but unfortunately there is no simple answer, it depends on the problem. Your choice of variables influences both how hard it is to express your problem constraints, and how efficiently it solves.
So the SUMs in the solution appears like [200,200,200...200,160,140,80,50,0,]. I would be glad if the workers will be more or less equally utilized. Is there some simple/efficient way how to achieve that?
You already mention the idea of minimizing differences, and this is how such a balancing requirement would usually be implemented. It does not need to be complicated. If originally we have this unbalanced first solution:
?- length(Xs,5), Xs#::0..9, sum(Xs)#=20, labeling(Xs).
Xs = [0, 0, 2, 9, 9]
then simply minimizing the maximum of list elements will already give you (in combination with the sum-constraint) a balanced solution:
?- length(Xs,5), Xs#::0..9, sum(Xs)#=20, Cost#=max(Xs), minimize(labeling(Xs),Cost).
Xs = [4, 4, 4, 4, 4]
Cost = 4
You could also minimize the difference between minimum and maximum:
?- length(Xs,5), Xs#::0..9, sum(Xs)#=20, Cost#=max(Xs)-min(Xs), minimize(labeling(Xs),Cost).
Xs = [4, 4, 4, 4, 4]
Cost = 0
or even the sum of squares.
[Sorry, my examples are for ECLiPSe rather than SWI/clpfd, but should show the general idea.]
How should I order the constraints? I think the order of constraints matters with respect to efficiency of the labeling.
You should not worry about this. While it might have some influence, it is too unpredictable and depends too much on implementation details to allow for any general recommendations. This is really the job of the solver implementer.
How to debug/optimize performance of labeling?
For realistic problems, you will often need (a) a problem-specific labeling heuristic, and (b) some variety of incomplete search.
Visualization of the search tree or the search progress can help with tailoring heuristics. You can find some discussion of these issues in chapter 6 of this online course.
We have a whole bunch of machines which use a whole bunch of data stores. We want to transfer all the machines' data to new data stores. These new stores vary in the amount of storage space available to the machines. Furthermore each machine varies in the amount of data it needs stored. All the data of a single machine must be stored on a single data store; it cannot be split. Other than that, it doesn't matter how the data is apportioned.
We currently have more data than we have space, so it is inevitable that some machines will need to have their data left where it is, until we find some more. In the meantime, does anyone know an algorithm (relatively simple: I'm not that smart) that will provide optimal or near-optimal allocation for the storage we have (i.e. the least amount of space left over on the new stores, after allocation)?
I realise this sounds like a homework problem, but I assure you it's real!
At first glance this may appear to be the multiple knapsack problem (http://www.or.deis.unibo.it/knapsack.html, chapter 6.6 "Multiple knapsack problem - Approximate algorithms"), but actually it is a scheduling problem because it involves a time element. Needless to say it is complicated to solve these types of problems. One way is to model them as network flow and use a network flow library like GOBLIN.
In your case, note that you actually do not want to fill the stores optimally, because if you do that, smaller data packages will be more likely to be stored because it will lead to tighter packings. This is bad because if large packages get left on the machines then your future packings will get worse and worse. What you want to do is prioritize storing larger packages, even if that means leaving more extra space on the stores, because then you will gain more flexibility in the future.
Here is how to solve this problem with a simple algorithm:
(1) Determine the bin sizes and sort them. For example, if you have 3 stores with space 20 GB, 45 GB and 70 GB, then your targets are { 20, 45, 70 }.
(2) Sort all the data packages by size. For example, you might have data packages: { 2, 2, 4, 6, 7, 7, 8, 11, 13, 14, 17, 23, 29, 37 }.
(3) If any of the packages sum to > 95% of a store, put them in that store and go to step (1). Not the case here.
(4) Generate all the permutations of two packages.
(5) If any of the permutations sum to > 95% of a store, put them in that store. If there is a tie, prefer a combination with a bigger package. In my example, there are two such pairs { 37, 8 } = 45 and { 17, 2 } = 19. (Notice that using { 17, 2 } trumps using { 13, 7 }). If you find one or more matches, go back to step (1).
Okay, now we just have one store left: 70 and the following packages: { 2, 4, 6, 7, 7, 11, 13, 14, 23, 29 }.
(6) Increase the number of perms by 1 and go to Step 5. For example, in our case we find that no 3-perm adds to over 95% of 70, but the 4 perm { 29, 23, 14, 4 } = 70. At the end we are left with packages { 2, 6, 7, 7, 11, 13 } that are left on the machines. Notice these are mostly the smaller packages.
Notice that perms are tested in reverse lexical order (biggest first). For example, if you have "abcde" where e is the biggest, then the reverse lexical order for 3-perms is:
cde
bde
ade
bce
ace
etc.
This algorithm is very simple and will yield a good result for your situation.
I'm working with gathering data from a biological monitoring system. They need to know the average value of the plateaus after changes to the system are made, as shown below.
This is data for about 4 minutes, as shown there is decent lag time between the event and the steady state response.
These values won't always be this level. They want me to find where the steady-state response starts and average the values during that time. My boss, who is a biologist, said there may be overshoot and random fluctuations... and that I might need to use a z-transform. Unfortunately he wasn't more specific than that.
I feel decently competent as a programmer, but wasn't sure what the most efficient way would be to go about finding these values.
Any algorithms, insights or approaches would be greatly appreciated. Thanks.
You may actually get a good start by just analyzing first derivative. Consider process steady if first derivative is close to zero. But please note that this is no 'silver bullet' type of solution, some nasty corner cases to expect.
Anyway based on above, a simple demonstration follows:
import numpy as np
# create first some artificial observations
obs= np.array([[0, 1, 1.5, 3.5, 4, 4.5, 7, 9.2, 10.5, 15],
[1, 2, 6, 6.01, 5.5, 4, 4.7, 3.3, 3.7, 3.65]])
x= np.linspace(obs[0][0], obs[0][-1], 1e2)
y= np.interp(x, obs[0], obs[1])
# and add some noise to it
y+= 1e-3* np.random.randn(y.shape[0])
# now find steady state based on first derivative< abs(trh), but
# smooth the signal first by convolving it with suitable kernel
y_s= np.convolve(y, [.2, .6, .2])
d, trh= np.diff(y_s), .015
stable= (np.abs(d)< trh)[:-1]
# and inspect visually
from pylab import grid, plot, show
plot(x, y), plot(x, y_s[1: -1])
plot(x[stable], np.ones(stable.sum()), 's')
grid(True), show()
With output like (where red dots indicates the assumed steady state process):
A simple method may be to calculate and track a moving average (that is, average the last N samples). When the average changes by less than a threshold, you can assume it's the steady-state.
The trick lies in choosing N and the threshold appropriately. You may be able to guess at reasonable values, or you can use several events' worth of data to train the system.
It looks like an interesting project—good luck!
i want to create a software for planning routes of busses (and their optimal loading) for disabled-kids-transportation.
these busses have following specifications:
m seats (maximum 7 - as there is a driver and an assistance)
o "seats" for wheel-chairs (maximum 4)
fixed amount of maximum load (in austria: 9 or 20 persons; 9 for eg. ford transit; 20 for eg. mercedes benz sprinter)
specifications for routes:
the journey to the institutes must be shorter than 2 hours for a kid (not for the bus)
for optimization: it may be optimal to mix institutes
example
the optimal route 1 would be:
6, 1, 7, group (2, 3, 4, 5), insitute A (exit for 1, 2, 3, 4, 5, 6), 8, 9, insitute B (exit for 7, 8, 9) or
1, 7, 6, group (2, 3, 4, 5), insitute A (exit for 1, 2, 3, 4, 5, 6), 8, 9, insitute B (exit for 7, 8, 9) or
7, 1, 6, group (2, 3, 4, 5), insitute A (exit for 1, 2, 3, 4, 5, 6), 8, 9, insitute B (exit for 7, 8, 9) or
...
depending on the specific road (aka the road distance for the triangle 1-6-3 and 7-1-6)
this is simple example. when it comes down to transport wheel-chairs it's more complex.
edit:
NOTE: there are more than 2 instutes, as there are more than 9 kids. this was just for giving an example. in real world there would be 600 kids and 20 institutes...
what data do i need?
my guesses were: coordinates, distances between points (not air-line distance, rather the road distance), type of "seat-usage" (seat or wheel-chair), somehow road specifications (may be obsolete due to distance)
can anyone please come up with some idea, algorithm, logic, feedback, (free! as disabled-kid-transportation is no enterprise business) software i may use to gain data (eg. coordinates, distances, ...).
oh, and i must say. i'm no studied software-engineer, so it's somehow hard to go through scentific literature, but i'm willed to get my hands dirty!
Well, this is actually what I do for a living. Basically, we solve this using MiP with column-generation and using a path-model. Seeing that the problem is quite small, I would think you can using the simpler edge-flow-model with a reasonable result. That will save you doing the column-generation, which is quite some work. I'd suggest starting out by calculating the flow on a given set off routes before thinking about generating the routes themselves --- in fact, I would simply do that "by hand" using the routing calculator and dual costs as a guide.
Specifically, you need to create a graph where each pickout and delivering point is a node, and each bus route is a set of connected notes. Connect as appropriate, this is really easier to draw than write :) Then, make a LP system that models the flow, constraining the flow to the bus' compacity, and either requiring that all passengers are delivered or pay a heavy toll for not doing so.
Once that is in place, create boolean variables for each route and multiply this with the capacity: this will enable you to turn bus routes on and off.
Ask for details as needed, the above is just a rough outline.
EDIT:
Ok, reading the responses, I think I have to say that to solve this problem in the way I suggest, you at least need to have some knowledge about linear programming and graph theory. And yes, this is a problem that is very hard... so hard that I consider it unsolvable except for very small systems using the current computer technology. Seeing that this actually is a very small. I think it would be possible, and you are very welcome to contact our company for help (contact#ange.dk). However, professional assistance in optimization is not exactly cheap.
However, all is not lost! There are simpler ways, though the result will not be so good. When you cannot model, simulate! Write a simulation that given the bus routes, passengers etc shows how the passengers move along the bus routes. Make a score, where each bus you use cost something, each kilometer cost something, and each passenger not transported costs a lot. Then look at the result, change the routes and work toward the best (cheapest) solution you can come up with. It is probably not a bad solution.
Again, creating a program that will generate a solution for the above problem from scratch is not a suitable enterprise for someone not versed in LP+MiP+graph theory. But perhaps less can do it?
I will be on vacation for the next week or so.
I have a couple of numerical datasets that I need to create a concept hierarchy for. For now, I have been doing this manually by observing the data (and a corresponding linechart). Based on my intuition, I created some acceptable hierarchies.
This seems like a task that can be automated. Does anyone know if there is an algorithm to generate a concept hierarchy for numerical data?
To give an example, I have the following dataset:
Bangladesh 521
Brazil 8295
Burma 446
China 3259
Congo 2952
Egypt 2162
Ethiopia 333
France 46037
Germany 44729
India 1017
Indonesia 2239
Iran 4600
Italy 38996
Japan 38457
Mexico 10200
Nigeria 1401
Pakistan 1022
Philippines 1845
Russia 11807
South Africa 5685
Thailand 4116
Turkey 10479
UK 43734
US 47440
Vietnam 1042
for which I created the following hierarchy:
LOWEST ( < 1000)
LOW (1000 - 2500)
MEDIUM (2501 - 7500)
HIGH (7501 - 30000)
HIGHEST ( > 30000)
Maybe you need a clustering algorithm?
Quoting from the link:
Cluster analysis or clustering is the
assignment of a set of observations
into subsets (called clusters) so that
observations in the same cluster are
similar in some sense. Clustering is a
method of unsupervised learning, and a
common technique for statistical data
analysis used in many fields
Jenks Natural Breaks is a very efficient single dimension clustering scheme: http://www.spatialanalysisonline.com/OUTPUT/html/Univariateclassificationschemes.html#_Ref116892931
As comments have noted, this is very similar to k-means. However, I've found it even easier to implement, particularly the variation found in Borden Dent's Cartography: http://www.amazon.com/Cartography-Thematic-Borden-D-Dent/dp/0697384950
I think you're looking for something akin to data discretization that's fairly common in AI to convert continuous data (or discrete data with such a large number of classes as to be unwieldy) into discrete classes.
I know Weka uses Fayyad & Irani's MDL Method as well as Kononeko's MDL method, I'll see if I can dig up some references.
This is only a 1-dimensional problem, so there may be a dynamic programming solution. Assume that it makes sense to take the points in sorted order and then make n-1 cuts to generate n clusters. Assume that you can write down a penalty function f() for each cluster, such as the variance within the cluster, or the distance between min and max in the cluster. You can then minimise the sum of f() evaluated at each cluster. Work from one point at a time, from left to right. At each point, for 1..# clusters - 1, work out the best way to split the points so far into that many clusters, and store the cost of that answer and the location of its rightmost split. You can work this out for point P and cluster size c as follows: consider all possible cuts to the left of P. For each cut add f() evaluated on the group of points to the right of the cut to the (stored) cost of the best solution for cluster size c-1 at the point just to the left of the cut. Once you have worked your way to the far right, do the same trick once more to work out the best answer for cluster size c, and use the stored locations of rightmost splits to recover all the splits that give that best answer.
This might actually be more expensive than a k-means variant, but has the advantage of guaranting to find a global best answer (for your chosen f() under these assumptions).
Genetic hierarchical clustering algorithm
I was wondering.
Apparently what you are looking for are clean breaks. So before launching yourself into complicated algorithms, you may perhaps envision a differential approach.
[1, 1.2, 4, 5, 10]
[20%, 333%, 25%, 100%]
Now depending on the number of breaks we are looking for, it's a matter of selecting them:
2 categories: [1, 1.2] + [4, 5, 10]
3 categories: [1, 1.2] + [4, 5] + [10]
I don't know about you but it does feel natural in my opinion, and you can even use a treshold approach saying that a variation less than x% is not worth considering a cut.
For example, here 4 categories does not seem to make much sense.