Simple storage allocation algorithm - algorithm

We have a whole bunch of machines which use a whole bunch of data stores. We want to transfer all the machines' data to new data stores. These new stores vary in the amount of storage space available to the machines. Furthermore each machine varies in the amount of data it needs stored. All the data of a single machine must be stored on a single data store; it cannot be split. Other than that, it doesn't matter how the data is apportioned.
We currently have more data than we have space, so it is inevitable that some machines will need to have their data left where it is, until we find some more. In the meantime, does anyone know an algorithm (relatively simple: I'm not that smart) that will provide optimal or near-optimal allocation for the storage we have (i.e. the least amount of space left over on the new stores, after allocation)?
I realise this sounds like a homework problem, but I assure you it's real!

At first glance this may appear to be the multiple knapsack problem (http://www.or.deis.unibo.it/knapsack.html, chapter 6.6 "Multiple knapsack problem - Approximate algorithms"), but actually it is a scheduling problem because it involves a time element. Needless to say it is complicated to solve these types of problems. One way is to model them as network flow and use a network flow library like GOBLIN.
In your case, note that you actually do not want to fill the stores optimally, because if you do that, smaller data packages will be more likely to be stored because it will lead to tighter packings. This is bad because if large packages get left on the machines then your future packings will get worse and worse. What you want to do is prioritize storing larger packages, even if that means leaving more extra space on the stores, because then you will gain more flexibility in the future.
Here is how to solve this problem with a simple algorithm:
(1) Determine the bin sizes and sort them. For example, if you have 3 stores with space 20 GB, 45 GB and 70 GB, then your targets are { 20, 45, 70 }.
(2) Sort all the data packages by size. For example, you might have data packages: { 2, 2, 4, 6, 7, 7, 8, 11, 13, 14, 17, 23, 29, 37 }.
(3) If any of the packages sum to > 95% of a store, put them in that store and go to step (1). Not the case here.
(4) Generate all the permutations of two packages.
(5) If any of the permutations sum to > 95% of a store, put them in that store. If there is a tie, prefer a combination with a bigger package. In my example, there are two such pairs { 37, 8 } = 45 and { 17, 2 } = 19. (Notice that using { 17, 2 } trumps using { 13, 7 }). If you find one or more matches, go back to step (1).
Okay, now we just have one store left: 70 and the following packages: { 2, 4, 6, 7, 7, 11, 13, 14, 23, 29 }.
(6) Increase the number of perms by 1 and go to Step 5. For example, in our case we find that no 3-perm adds to over 95% of 70, but the 4 perm { 29, 23, 14, 4 } = 70. At the end we are left with packages { 2, 6, 7, 7, 11, 13 } that are left on the machines. Notice these are mostly the smaller packages.
Notice that perms are tested in reverse lexical order (biggest first). For example, if you have "abcde" where e is the biggest, then the reverse lexical order for 3-perms is:
cde
bde
ade
bce
ace
etc.
This algorithm is very simple and will yield a good result for your situation.

Related

Placing array of rectangles inside a given area

I want to place a set of rectangles into an area with a given width and minimize the total height. I have made a solution in miniZinc, but the running time is quite long.
The final result came after a couple of seconds, but the running didn't stop until 10 minutes had passed. Is it a way to make the running time a bit faster/ improve the performance? And should I use cumulative constraints ?
This appears to be a straightforward Rectangle Packing Problem. Algorithms that work well for this specific problem, including variations such as Square Packing, Rectilinear packing with/without rotations.
In MiniZinc in particular this is a topic discussed in the second MiniZinc Coursera course: Advanced Modeling for Discrete Optimization. This should give you all required knowledge to create a good model for the problem.
Here are some comments in addition to #Dekker1's answer.
First some questions: Which solver did you use? Was the first answer found directly the optimal solution?
You might get a faster solve time with some other FlatZinc solver. I tested the model that you originally included in the question (and later removed) and with some different FlatZinc solvers. Also, I printed out the objective value (makespan) to compare the intermediate solutions.
Gecode: Finds a solution immediately with a makespan of 69, but then takes long time find the optimal value (which is 17). After 15 minutes no improvement was done, and I stopped the run. For Gecode you might get (much) better result with different search strategies, see more on this here: https://www.minizinc.org/doc-2.3.1/en/lib-annotations.html#search-annotations .
Chuffed: finds a makespan of 17 almost directly, but it took in all 8min28s to prove that 17 is the optimal value. Testing with free search is not faster (9min23s).
OR-tools: Finds the makespan (and proves that it's optimal) of 17 in 0.6s.
startX: [0, 0, 0, 3, 3, 13, 0, 14, 6, 9, 4, 6, 0, 10, 7, 15]
startY: [0, 3, 7, 0, 6, 5, 12, 0, 2, 6, 12, 0, 15, 12, 12, 12]
makespan: 17
----------
==========
The OR-tools solver can sometimes be faster when using free search (the -f flag), but in this case it's slower: 4.2s. At least with just 1 thread. When adding some more threads (here 12), the optimal solutions was found in 0.396s with the free search flag.
There are quite a lot of different FlatZinc solver that one can test. See the latest MiniZinc Challenge page for some of them: https://www.minizinc.org/challenge2021/results2021.html .
Regarding cumulative, it seems that some solvers might be faster with this constraint, but some are slower. The best way is to compare with and without the constraint on some different problem instances.
In summary, one might occasionally have to experiment with different constraints and/or solvers and/or search strategies.

What Does 8 Words of data mean?

3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
You are asked to optimize a cache design for the given references. There are
three direct-mapped cache designs possible, all with a total
of 8 words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3
has 4-word blocks. In terms of miss rate, which cache design is the
best? If the miss stall time is 25 cycles, and C1 has an access time
of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the
best cache design?
Okay, so that's the question I need to answer, and I am kind of confused. I understand how a cache works, and I understand how to calculate a miss and hit depending on the tag and index and what not. But what my question is, how many blocks am I using for these caches? I know that we're using 3 different caches with different word-blocks, so we can place more addresses into a block, for C2 for example we can place in 2 words, so 2 addresses. But what does it mean when it says "8 words of data"? I am having trouble understanding this question.
I assume that the more word-blocks there are, the better the hit rate, since we're able to store more addresses. But what does 8 words of data mean exactly, I guess that's my question?
Caches do not only contain data, there is information which has to be kept but isn't usable in the sense that it represents data which will be retrieved and operated with. With this, words of data means 4-byte-long continuous cache storage segments meant for storing data.
The Index and Tag fields are, for example, not data.
See www.csbio.unc.edu/mcmillan/Media/Comp411S12PS6Sol.pdf and https://cseweb.ucsd.edu/classes/su07/cse141/cache-handout.pdf

Optimizing Algorithm for Price of Delivery

I have a customer who ordered 3 items online.
Then I have a list of store for each item, sorted by cheapest delivery rate.
For example I have 5 stores. Then there is 5^3 = 125 combinations of stores.
So
Item 1 - store 1, store 9, store 4, store 3, store 2
Item 2 - store 9, store 10, store 1, store 2, store 5
Item 3 - store 5, store 1, store 4, store 8, store 7
So store 1,9,5 have the lowest delivery rate respectively for Items 1, 2, and 3.
But you can see that I can send both Item 1 and 2 from store 9 and store 2, and I can send all three items from store 1.
When sending a package, we might use a box with a certain dimensions, and maybe sending Items 1 and 2 from store 9 will be cheaper than sending Item 1 from store 1 and Item 2 from store 9.
The same applies to store 1. Maybe sending all 3 items from store 1 in a box will be cheaper than sending them separately from stores 1, 9, and 5.
Right now I am thinking about checking the box delivery rate of all the stores that contains 2 or more items and trying to determine the lowest price.
Know that sometime the customer can order more than 10 items and the number of combinations will then be 5^10+ which is huge.
I am wondering if there is any quicker way to find the best price.
The way I would approach this is using integer programming, defining two sets of variables.
The first set of variables would be binary variables that indicate whether an item is sent from a particular store. In your example, there would be 15 such binary variables, one for each item-store pairing. We can call the binary variable for item i and store s x_is.
The other set of variables would be a binary indicator for whether we ship any items from store s. In your example, there would be 9 such binary variables (you have stores 1, 2, 3, 4, 5, 7, 8, 9, and 10). We can call the binary variable for store s y_s.
Then you would need to add constraints that make sure that an item i is sent from a store s (aka x_is = 1) only when we ship from that store (aka y_s = 1). You can do this by adding a constraint x_is <= y_s for all items i and stores s.
Now you can build an objective that separately indicates the cost of providing an item i from store x (these are the coefficients on the x_is variables) and the cost of shipping a package from store s (these are the coefficients on the y_s variables). Your goal would be to minimize this objective.
You can solve these models using any number of different programming languages. One of the simplest might by the Excel Solver package, though there are integer programming solvers in all major programming languages.

What is the best data structure to store ranges for fast queries?

I have this situation where I have N timelines each containing blocks. The blocks contain tokens with a specific index and know their maximum and minimum token indexes. There's also an index mapping blocks' first indexes to a (timeline, block) pair. Here is an example:
Timeline 1: [1 2 5 8 9 11] [14 17 18 21] [22 23 25 26] ...
Timeline 2: [3 4 6 7 10 12] [13 15 16 19 20 24] [27 28 34 45] ...
Index:
1 -> timeline 1, block 1
3 -> timeline 2, block 1
13 -> timeline 2, block 2
14 -> timeline 1, block 2
22 -> timeline 1, block 3
27 -> timeline 2, block 3
As you can see, there's no missing token (no gap).
Those data structures are what I have initially. What would be the best alternative data structure to optimize queries of a specific token index? Say I want to retrieve token 19. Now what I have to do is: a dichotomic search in the index to find the good blocks for each timeline, and then a full search within each block. With token 19, the dichotomic search would result in blocks (1, 2) and (2, 2) which can contain 19, and then do a full linear search to find token 19 (no dichotomic search within blocks is possible here since tokens have various sizes and are not contained in any data structure yet).
Thank you!
Edit: I'm thinking about using an interval tree containing intervals of all the timelines. The problem is that a query would still result in many intervals. Plus, it doesn't optimize too much compared to binary searches.
You could have an array A of t pointers to objects that hold a pointer to the token, its timeline, and block. If you can hold references in an array using whatever mechanism your language likes.. I'm not sure what you can do if you can't binary search inside blocks.
The simplest way to my mind (if it does not take a lot of memory space) is to create an array of blob values, where index is your query token (19 - in your example) and the value is the blob part that corresponds to it. Array should be good, as you don't have gaps. Constructing this array is O(n) and searching there is O(1). But this will bring some benefits only if amount of queries is relatively big, as the existing structure is also good optimized already. (Should actually do testing here, which way is quicker.)
Constructing array:
array = []
foreach ( timeline in timelines ){
foreach ( block in timeline){
foreach( token in block ){
array[token.index] = token.value
}
}
}
If that is too costly, try saving only timeline number for the token. This way you will not have to search every timeline, when the query comes. All you will have to do is to take the timeline, binary search a block, and plain forward search inside a block.
Maybe you can use a sparse space filling curve? When you have the index it's a function that's reduce the dimension. A space filling curve is the same but it's also adds a spatial information to the index. Another data structure for a space filling curve or a spatial index is a quadtree. Hence you can use a quadtree or a kd-tree to search.

algorithm/logic to balance the load and determine the route of busses

i want to create a software for planning routes of busses (and their optimal loading) for disabled-kids-transportation.
these busses have following specifications:
m seats (maximum 7 - as there is a driver and an assistance)
o "seats" for wheel-chairs (maximum 4)
fixed amount of maximum load (in austria: 9 or 20 persons; 9 for eg. ford transit; 20 for eg. mercedes benz sprinter)
specifications for routes:
the journey to the institutes must be shorter than 2 hours for a kid (not for the bus)
for optimization: it may be optimal to mix institutes
example
the optimal route 1 would be:
6, 1, 7, group (2, 3, 4, 5), insitute A (exit for 1, 2, 3, 4, 5, 6), 8, 9, insitute B (exit for 7, 8, 9) or
1, 7, 6, group (2, 3, 4, 5), insitute A (exit for 1, 2, 3, 4, 5, 6), 8, 9, insitute B (exit for 7, 8, 9) or
7, 1, 6, group (2, 3, 4, 5), insitute A (exit for 1, 2, 3, 4, 5, 6), 8, 9, insitute B (exit for 7, 8, 9) or
...
depending on the specific road (aka the road distance for the triangle 1-6-3 and 7-1-6)
this is simple example. when it comes down to transport wheel-chairs it's more complex.
edit:
NOTE: there are more than 2 instutes, as there are more than 9 kids. this was just for giving an example. in real world there would be 600 kids and 20 institutes...
what data do i need?
my guesses were: coordinates, distances between points (not air-line distance, rather the road distance), type of "seat-usage" (seat or wheel-chair), somehow road specifications (may be obsolete due to distance)
can anyone please come up with some idea, algorithm, logic, feedback, (free! as disabled-kid-transportation is no enterprise business) software i may use to gain data (eg. coordinates, distances, ...).
oh, and i must say. i'm no studied software-engineer, so it's somehow hard to go through scentific literature, but i'm willed to get my hands dirty!
Well, this is actually what I do for a living. Basically, we solve this using MiP with column-generation and using a path-model. Seeing that the problem is quite small, I would think you can using the simpler edge-flow-model with a reasonable result. That will save you doing the column-generation, which is quite some work. I'd suggest starting out by calculating the flow on a given set off routes before thinking about generating the routes themselves --- in fact, I would simply do that "by hand" using the routing calculator and dual costs as a guide.
Specifically, you need to create a graph where each pickout and delivering point is a node, and each bus route is a set of connected notes. Connect as appropriate, this is really easier to draw than write :) Then, make a LP system that models the flow, constraining the flow to the bus' compacity, and either requiring that all passengers are delivered or pay a heavy toll for not doing so.
Once that is in place, create boolean variables for each route and multiply this with the capacity: this will enable you to turn bus routes on and off.
Ask for details as needed, the above is just a rough outline.
EDIT:
Ok, reading the responses, I think I have to say that to solve this problem in the way I suggest, you at least need to have some knowledge about linear programming and graph theory. And yes, this is a problem that is very hard... so hard that I consider it unsolvable except for very small systems using the current computer technology. Seeing that this actually is a very small. I think it would be possible, and you are very welcome to contact our company for help (contact#ange.dk). However, professional assistance in optimization is not exactly cheap.
However, all is not lost! There are simpler ways, though the result will not be so good. When you cannot model, simulate! Write a simulation that given the bus routes, passengers etc shows how the passengers move along the bus routes. Make a score, where each bus you use cost something, each kilometer cost something, and each passenger not transported costs a lot. Then look at the result, change the routes and work toward the best (cheapest) solution you can come up with. It is probably not a bad solution.
Again, creating a program that will generate a solution for the above problem from scratch is not a suitable enterprise for someone not versed in LP+MiP+graph theory. But perhaps less can do it?
I will be on vacation for the next week or so.

Resources