Optimization Algorithm for Collecting a Set of Items - algorithm

This is analogous to an algorithm I'm in need of...
I want to purchase a complete set of trading cards - one of each card.
Each card can be bought from different vendors at different prices. If I purchase a given card from a certain vendor they sometimes offer one or more additionals cards as a bundle for an additional charge. Bundles can sometimes offer significant discounts, but some vendors sell singles that are really cheap without needing bundles.
If I have a complete price list for every single card and bundle from every vendor, can someone suggest an algorithm that would efficiently calculate (or approximate) the cheapest way to purchase a complete set of cards?

First, my intuition tells me that this is at least an NP-hard problem. I don't think that you can check if a provided shopping cart is actually the cheapest in polynomial time. If you buy one card from a different vendor and choose an entirely different bundle, you may be missing cards. You may now buy those missing cards from a different vendor for a new bundle, causing a whole cascade of changes.
I'll try to formulate an application to a known NP problem with an approximate solution.
Ok, so you know not only the prices of cards and bundles, but exactly which cards you need to buy to activate the bundle offers.
This sounds like you could adapt a (modified) minimum spanning tree (MST) algorithm here.
Here's how I think you should form your DIRECTED graph:
Make one node for each card. Call this type of node C.
Create an extra node for each card that a vendor sells. Call this type of node CV.
Connect corresponding CV's as the source to C's as the destination. The weight is 0.
For each CV that has a bundle offer, create a node BCV. Connect from CV to BCV with the bundle's price.
From each BCV, connect to all the C's in that bundle with a weight of 0.
Now make a root node connecting to all nodes of type CV, each weight equal to the vendor's price for that card.
From this formulation, you need an MST that only needs to reach nodes of type C. This MST would tell you which cards to buy and which bundles to buy. If you connect to a card CV, that means you've gotten the corresponding C (weights are 0 so they're free). If you additionally choose to choose to connect to the corresponding BCV, then you're connected to all the C's in that bundle for free.
It turns out that these types of trees are called Steiner trees, and are known to be NP-hard. This paper and this paper that I found through a Google search seems to present an approximation algorithm for it. There are more that you can look up. However, it doesn't look like anyone has actually implemented this in some kind of Python package that's readily usable.

You can find an approximate solution (and, in certain cases, the full solution) with convex optimization. The trick is that you may not need to constrain the card purchases to be an integer, because the optimum number will often be an integer anyway. Problem setup:
Say there are N cards in the complete set. Assign an index to each card, 0 thru N-1.
For each possible bundle purchase i, create a length N vector Cardsi that is 1 at the indices of cards it contains, and 0 at all other indices.
Also the bundle purchase has a price called Pricei. Your free optimization variable is Choicei.
So you have 2 constants and one optimization variable per bundle. Then your optimization problem is:
Minimize sum(Pricei * Choicei)
Subject to:
sum(Cardsi * Choicei)>= ones(N,1)
Choicei >= 0
The ones(N,1) is your requirement for the number of cards you want to own.
You should be able to set that up pretty easily with the cvxpy library: http://www.cvxpy.org/ After solving, you'll need to check that its resulting Choicei variables are all near 0 or near 1. If they're not, it means your data has a pathological combination of bundles as described in mcdowella's comment below. In that case you'll need to do an NP-hard search as described in other answers. Or you can do a hack, such as buy those cards individually and accept that it's not guaranteed optimal.

Related

Algorithm for picking orders from warehouses

I'll explain My Problem With an example.
Let's say we have:
An Order from a certain store for five products, We will name those products A,B,C,D, & E, with their quantities In the Order A(19),B(25),C(6),D(33),E(40).
A single Truck that can fit different amount of each product:A(30), B(40), C(25), D(50), E(30).
Ex: Transporting A & B together, I loaded the the Truck with A(19) so that's two thirds of what my Truck can handle, So that leaves one third for B, Which means i can only transport 1/3 of B's maximum Truck capacity which is (40/3 ≈ 13).
A Set of Warehouses which contains different amounts of each product.
I made an Excel spreadsheet which contains more useful info regarding those Warehouses like( Quantities, Distance from Each other, Distance from store ).
I want to Deliver this order to the store with the least amount of trips and distance traveled.
Is there an Algorithm for this kind of problem, Or something close i can modify on?
EDIT: Updated Links.
I would advise not to reinvent a wheel as a very first step of your work. Developing/adopting a custom algorithm for such a problem would be a very painful venture in my opinion. I would suggest using either a constraint satisfaction programming (CSP) toolkit or a direct mixed integer programming (MIP) solver.
My point is that it would be much easier to encode your problem using such tools. If performance/accuracy won't be enough for you - you could design a custom solution based on your preliminary results.
For CSP I would suggest Minizinc which has decent documentation and examples.
You could start your MIP research with GLPK. It's not very powerful, but it's definitely capable of dealing with some toy examples.

Algorithms for Minimum resource requirements

I have a question for which I have made some solutions, but I am not happy with the scalability. I'm looking for input of some different approaches / algorithms to solving it.
Problem:
Software can run on electronic controllers (ECUs) and requires
different resources to run a given feature. It may require a given
amount of storage or RAM or a digital or Analog Input or Output for
instance. If we have multiple features and multiple controller options
we want to find the combination that minimizes the hardware
requirements (cost). I'll simplify the resources to letters to
simplify the understanding.
Example 1:
Feature1(A)
ECU1(A,B,C)
First a trivial example. Lets assume that a feature requires 1 unit of resource A, and ECU has 1 unit of resources A, B and C available, it is obvious that the feature will fit in the ECU with resources B & C left over.
Example 2:
Feature2(A,B)
ECU2(A|B,B,C)
In this example, Feature 2 requires resources A and B, and the ECU has 3 resources, the first of which can be A or B. In this case, you can again see that the feature will fit in the ECU, but only if check in a certain order. If you assign F(A) to E(A|B), then F(B) to E(B) it works, but if you assign F(B) to E(A|B) then there is no resource left on the ECU for F(A) so it doesn't appear to fit. This would lead one to the observation that we should prefer non-OR'd resources first to avoid such a conflict.
An example of the above could be a an analog input could also be used as a digital input for instance.
Example 3
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C)
Now things are a little bit more complicated, but it is still quite obvious to a person that the feature will fit into the ECU.
My problems are simply more scaled up versions of these examples (i.e. multiple features per ECU with more ECUs to choose from.
Algorithms
GA
My first approach to this was to use a genetic algorithm. For a given set of features i.e. F(A,B,C,D), and a list of currently available ECUs find which single or combination of ECUs fit the requirements.
ECUs would initially be randomly selected and features checked they fitted and added to them. If a feature didn't fit another ECU was added to the architecture. A population of these architectures was created and ranked based on lowest cost of housing all the features. Architectures could then be mated in successive generations with mutations and such to improve fitness.
This approached worked quite well, but tended to get stuck in local minima (not the cheapest option) based on a golden example I had worked by hand.
Combinatorial / Permutations
My next approach was to work out all of the possible permutations (the ORs from above) for an ECU to see if the features fit.
If we go back to example 2 and expand the ORs we get 2 permutations;
Feature2(A,B)
ECU2(A|B,B,C) = (A,B,C), (B,B,C)
From here it is trivial to check that the feature fits in the first permutation, but not the second.
...and for example 3 there are 12 permutations
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C) = (A,B,A), (B,B,A), (C,B,A), (A,C,A), (B,C,A), (C,C,A), (A,B,C), (B,B,C), (C,B,C), (A,C,C), (B,C,C), (C,C,C)
Again it is trivial to check that feature 3 fits in at least one of the permutations (3rd, 5th & 7th).
Based on this approach I was also able to get a solution also, but I have ECUs with so many OR'd inputs that I have millions of ECU permutations which drastically increased the run time (minutes). I can live with this, but first wanted to see if there was a better way to skin the cat, apart from Parallelizing this approach.
So that is the problem...
I have more ideas on how to approach it, but assume that there is a fancy name for such a problem or the name of the algorithm that has been around for 20+ years that I'm not familiar with and I was hoping someone could point me in that direction to either some papers or the names of relevant algorithms.
The obvious remark of simply summing the feature resource requirements and creating a new monolithic ECU is not an option. Lastly, no, this is not in any way associated with any assignment or problem given by a school or university.
Sorry for the long question, but hopefully I've sufficiently described what I am trying to do and this peaks the interest of someone out there.
Sincerely, Paul.
Looks like individual feature plug can be solved as bipartite matching.
You make bipartite graph:
left side corresponds to feature requirements
right side corresponds to ECU subnodes
edges connect each left and right side vertixes with common letters
Let me explain by example 2:
Feature2(A,B)
ECU2(A|B,B,C)
How graph looks:
2 left vertexes: L1 (A), L2 (B)
3 right vertexes: R1 (A|B), R2 (B), R3 (C)
3 edges: L1-R1 (A-A|B), L2-R1 (B-A|B), L2-R2 (B-B)
Then you find maximal matching for unordered bipartite graph. There are few well-known algorithms for it:
https://en.wikipedia.org/wiki/Matching_(graph_theory)
If maximal matching covers every feature vertex, we can use it to plug feature.
If maximal matching does not cover every feature vertex, we are short of resources.
Unfortunately, this approach works like greedy algorithms. It does not know of upcoming features and does not tweak solution to fit more features later. Partially optimization for simple cases can work like you described in question, but in general it's dead end - only algorithm that accounts for every feature in whole feature set can make overall effective solution.
You can try to add several features to one ECU simultaneously. If you want to add new feature to given ECU, you can try all already assigned features plus candidate feature. In this case local optimum solution will be found for given feature set (if it's possible to plug them all to one ECU).
I've not enough reputation to comment, so here's what i wanted to propose for your problem:
Like GA there are some other Random Based approaches too e.g. Bayesian Apporaoch , Decision Tree etc.
In my opinion Decision Tree will suit your problem as it, against some input dataset/attributes, shows a path to each class(in your case ECUs) that helps to select right class/ECU. Train your system with some sample data sets so that it can decide right ECU for your actual data set/Features.
Check Decision Trees - Machine Learning for more information. Hope it helps!

Algorithm for optimal packing with known inventory

Hospitals are changing the way they sterilize their equipment. Previously the local surgeons kept all their own equipment and made their own surgery trays. Now they have to confine to a country wide standard. They want to know how many of the new trays they can make from their existing stock, and how much new equipment they need to buy.
The inventory of medical equipment looks like this:
http://pastebin.com/rstWSurU
each hospitals has codes for various medical equipment and then a number for how many they have of the corresponding item
3 surgery trays with their corresponding items are show in this dictionary.
http://pastebin.com/bUAZhanK
There are a total of 144 different operation trays
the hospitals will be told they need 25 of tray x, 30 of tray y, etc...
They would like to maximize the amounts of trays they can finish with their current stock. They would also like to know what equipment they need to purchase in order to finish the remaining trays.
I have thought about two possible solutions one being representing the problem as a linear programming problem. The other solving the problem by doing a round-robin brute force solve of the first 90% of the problem and solving the remaining 10% by doing a randomized algorithm several times and then pick the best solve of those tries.
I would love to hear if anyone knows a smart way of how to tackle this problem!
If I understand this correctly we can optimize for each hospital separately. My guess is that the following would be a good start for an MIP (Mixed Integer Programming) model:
I use the following indices: i is items and t is trays. x(t,i) indicates how many items we assign to each tray type. y(t) counts the number of trays of each type that we can compose using the available items. From the solution we can calculate the shortages that we need to order.
Of course we are just maximizing the number of trays we can make. There is no consideration of balancing (many trays of one type and few or zero of another). I mitigate a little bit by not allowing to create more trays than required (if we have more items they need to go to other types of trays). This requirement is formulated as an upper bound on y(t).
For large problems we can restrict the (t,i) combinations to the ones that are possible. This will make the model smaller. When using precise math notation:
A further optimization would be to substitute out the variables x(t,i).
Adding shipping surplus items to other hospitals would make the model more difficult. In that case we could end up with a model that needs to look at all hospitals simultaneously. May be an interesting case for some decomposition approach.

How to find Best Price for a Deck of Collectible Cards?

Or The Traveling Salesman plays Magic!
I think this is a rather interesting algorithmic challenge. Curious if anyone has any good suggestions for solving it, or if it is already solvable in a known way.
TCGPlayer.com sells collectible cards for a variety of games, including Magic the Gathering. Instead of just selling cards from their inventory they are actually a re-seller from multiple vendors (50+). Each vendor has a different inventory of cards and a different price per card. Each vendor also charges a flat rate for shipping (usually). Given all of that, how would one find the best price for a deck of cards (say 40 - 100 cards)?
Just finding the best price for each card doesn't work because if you order 10 cards from 10 different vendors then you pay shipping 10 times, but if you order all 10 from one vendor you only pay shipping once.
The other night I wrote a simple HTML Scraper (using HTML Agility Pack) that grabs all the different prices for each card, and then finds all the vendors that carry all the cards in the deck, totals the price of the cards from each vendor and sorts by price. That was really easy. The total prices ended up being near the total median price for all the cards.
I did notice that some of the individual cards ended up being much higher than the median price. That raises the question of splitting an order over multiple vendors, but only if enough savings could be made by splitting the order up to cover the additional shipping (each added vendor adds another shipping charge).
Logically it seems that the best price will probably only involve a few different vendors, but if the cards are expensive enough (and some are) then in theory ordering each card from a different vendor could still result in enough savings to justify all the extra shipping.
If you were going to tackle this how would you do it? Pure brute force figuring every possible combination of card / vendor combinations? A process that is more likely to be done in my lifetime would seem to involve a methodical series of estimates over a fixed number of iterations. I have a couple ideas, but am curious what others might suggest.
I am looking more for the algorithm than actual code. I am currently using .NET though, if that makes any difference.
I would just be greedy.
Assume that you are going to eat the shipping cost and buy from all vendors. Work out the absolute lowest price you get. Then for each vendor work out how much being able to buy some cards from them versus someone else saves you. Order the vendors by shipping - incremental savings.
Starting with the vendors who provide the least value, axe that vendor, redistribute their cards to the other vendors, and recalculate incremental savings. Wash, rinse, and repeat until your most marginal vendor is saving you money.
This should find a good solution but is not guaranteed to find the best solution. Finding the absolute best solution, though, seems likely to be NP-hard.
This is isomorphic to the uncapacitated facility location problem.
card in the deck : client
vendor : possible facility location
vendor shipping rate : cost of opening a facility at a location
cost of a card with a particular vendor : "distance" from a client to a facility
Facility location is a well-studied problem in the combinatorial optimization literature.
Interesting question! :)
So if we have n cards and m vendors, the brute force approach might have to check up to n^m combinations, right (a bit less since not each vendor has each card, but I guess that doesn't really matter in the grand scheme of things ;).
Let's for a second assume each vendor has each card and then see later-on how things change if they don't.
find the cheapest one-vendor solution.
order the cards by price, find the most expensive card that's cheaper at another vendor.
for all cards from vendor 1, move them to vendor 2 if they're cheaper there.
if having added vendor 2 doesn't make the order cheaper, undo and terminate, otherwise repeat from step 2
So if one vendor doesn't have all cards, you have to start with a multi-vendor situation. For each vendor, you might start by buying all cards that exist there, then apply the algorithm to the remaining cards.
Obviously, you may not be able to exploit all subtleties in the pricing with this method. But if we assume that a large portion of the price differences is made up by individual high-price cards, I think you can find a reasonable solution with this way.
Ok after writing all this I realized, the n^m assumption is actually wrong.
Once you have chosen a set of vendors to buy from, you can simply choose the cheapest vendor for each card. This is a great advantage because the individual choices of where to buy each card don't interfere with each other.
What does this mean for our problem? From the first look of it, it means that the selection of dealers is the problem (in terms of computational complexity), not the individual allocation of your buying choices. So instead of n^m, you got 2^m possible configurations in the worst case. So what we need is a heuristic for choosing vendors rather than choosing individual cards. Which might make the heuristic from above actually even more justifiable.
I myself have pondered this. Consider the following:
If it takes you a week to figure out,
code, and debug and algorithm that
only provides a 1% discount, would you
do it?
The answer is probably "No" (unless you're spending your entire life savings on cards, in which case you may be crazy). =)... or Amazon.com
Consequently, there is already an easy approximating algorithm:
Wait until you're buying lots of cards (reduce the shipping overhead).
Buy the cards from 3 vendors:
- the two with the cheapest-but-most-diverse inventories
- a third which isn't really cheap but definitely has every card you'd want.
Optimize accordingly (for each card, buy from the cheaper one).
Also consider local vendors you could just walk to, pre-constructed decks, and trading.
Based on firsthand and second experience, I can say you will find that you can get the median price with perhaps a few dollars more shipping you could otherwise, while still getting around median on each. You will find that you may have to pay a tiny bit more for understocked cards, but this will be few and far between, and the shipping savings will make up for it.
I recall the old programming adage: "Never optimize, until it's absolutely necessary; chances are you won't need to, or would have optimized the wrong thing." (e.g. your time is a resource too, and also has monetary value)
edit: Given that, this is an amazingly cool problem and one should solve it if one has time.
my algorithm goes like this
for each card calculate the average price available i.e sum of the price available from each vendor divide by the no of vendors.
now for that card select a vendor that offers less than or equal to average price.
now for each card we will have the list of vendors. now go for the intersection this way we will end up with series of vendor providing the maximxum no of cards at average or below average price.
i'm still thinking over the next steps but im putting the rough idea over here
now we are left with cards which are providing us single card. for such cards we will look into the price list of alredy short listed vendors with max no of cards and if the price diff is less than the shipping cost the we add the card to that vendors list.
i know this will require a huge optimization. but this what i have roghly figured out hope this helps
How about this:
Calculate the average price per ordered card across all vendors.
For each vendor that has at least one of the cards, calculate the total savings for all cards in the order as the difference between each card's price at that vendor and the average price.
Start with the vendor with the highest total savings and select all of those cards from that vendor.
Continue to select vendors with the next highest total savings until you have all of the cards in the order selected. Skip vendors that don't have cards that you still need.
From the selected list of vendors, redistribute the card purchases to the vendors with the best price for that card.
From the remaining list of vendors, and if the list is small enough, you could then brute force any vendors with a low card count to see if you could move the cards to other vendors to eliminate the shipping cost.
I actually wrote this exact thing last year. The first thing I do after loading all the prices is I weed out my card pool:
Each vendor can have multiple
versions of each card, as there are
reprints. Find the cheapest one.
Eliminate any card where the card value is greater than the cheapest card+shipping combo. That is, if I can buy the card cheaper as a one-off to a vendor than I can by adding it to an existing order from your store, I will buy it from the other vendor.
Eliminate any vendor whose offering I can buy cheaper (for every card) from another vendor. Basically, if another vendor out-prices you on every card, and on the total + shipping, then you are gone.
Unfortunately, this still leaves a huge pool.
Then I do some sorting and some brute-force-depth-first summing and some pruning and eventually end up with a result.
Anyway, I tuned it up to the point that I can do 70 cards and, within a minute, get within 5% of the optimal goal. And in an hour, less than 2%. And then, a couple of days later, the actual, final result.
I am going to read more about facility planning. Thanks for that tip!
What about using genetic algorithm? I think I'll try that one myself. You might manipulate the pool by adding both a chromosome with lowest prices, and another with lowest shipping costs.
BTW, did you finally implement any of the solutions presented here? which one? why?
Cheers!

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria.
Imagine you are managing newspaper delivery persons.
You have a set of street addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.
So... ideas?
UPDATE
The street network graph, as described in Arachnid's answer, is not available.
I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.
The algorithm works on a list if (x,y) coords ps that are specified as ints. It takes three other parameters as well:
radius (r): given a point, what is the radius for scanning for nearby points
max addresses (maxA): what are the maximum number of addresses (points) per cluster?
min addresses (minA): minimum addresses per cluster
Set limitA=maxA.
Main iteration:
Initialize empty list possibleSolutions.
Outer iteration: for every point p in ps.
Initialize empty list pclusters.
A worklist of points wps=copy(ps) is defined.
Workpoint wp=p.
Inner iteration: while wps is not empty.
Remove the point wp in wps. Determine all the points wpsInRadius in wps that are at a distance < r from wp. Sort wpsInRadius ascendingly according to the distance from wp. Keep the first min(limitA, sizeOf(wpsInRadius)) points in wpsInRadius. These points form a new cluster (list of points) pcluster. Add pcluster to pclusters. Remove points in pcluster from wps. If wps is not empty, wp=wps[0] and continue inner iteration.
End inner iteration.
A list of clusters pclusters is obtained. Add this to possibleSolutions.
End outer iteration.
We have for each p in ps a list of clusters pclusters in possibleSolutions. Every pclusters is then weighted. If avgPC is the average number of points per cluster in possibleSolutions (global) and avgCSize is the average number of clusters per pclusters (global), then this is the function that uses both these variables to determine the weight:
private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)
{
double weight = 0;
for (Cluster cluster : pclusters)
{
int ps = cluster.getPoints().size();
double psAvgPC = ps - avgPC;
weight += psAvgPC * psAvgPC / avgCSize;
weight += cluster.getSurface() / ps;
}
return new WeightedPClusters(pclusters, weight);
}
The best solution is now the pclusters with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC). End main iteration.
Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.
To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16, then we find 2 clusters of 16 addresses.
(source: paperboyalgorithm at sites.google.com)
Next, if we decrease the minimum number of addresses per cluster by setting minA=12, we find 3 clusters of 12/12/8 points.
(source: paperboyalgorithm at sites.google.com)
And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r here is only 5.
(source: paperboyalgorithm at sites.google.com)
Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.
Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.
(source: paperboyalgorithm at sites.google.com)
(source: paperboyalgorithm at sites.google.com)
What you are describing is a (Multi)-Vehicle-Routing-Problem (VRP). There's quite a lot of academic literature on different variants of this problem, using a large variety of techniques (heuristics, off-the-shelf solvers etc.). Usually the authors try to find good or optimal solutions for a concrete instance, which then also implies a clustering of the sites (all sites on the route of one vehicle).
However, the clusters may be subject to major changes with only slightly different instances, which is what you want to avoid. Still, something in the VRP-Papers may inspire you...
If you decide to stick with the explicit clustering step, don't forget to include your distribution in all clusters, as it is part of each route.
For evaluating the clusters using a graph representation of the street grid will probably yield more realistic results than connecting the dots on a white map (although both are TSP-variants). If a graph model is not available, you can use the taxicab-metric (|x_1 - x_2| + |y_1 - y_2|) as an approximation for the distances.
I think you want a hierarchical agglomeration technique rather than k-means. If you get your algorithm right you can stop it when you have the right number of clusters. As someone else mentioned you can seed subsequent clusterings with previous solutions which may give you a siginificant performance improvement.
You may want to look closely at the distance function you use, especially if your problem has high dimension. Euclidean distance is the easiest to understand but may not be the best, look at alternatives such as Mahalanobis.
I'm presuming that your real problem has nothing to do with delivering newspapers...
Have you thought about using an economic/market based solution? Divide the set up by an arbitrary (but constant to avoid randomness effects) split into even subsets (as determined by the number of delivery persons).
Assign a cost function to each point by how much it adds to the graph, and give each extra point an economic value.
Iterate allowing each person in turn to auction their worst point, and give each person a maximum budget.
This probably matches fairly well how the delivery people would think in real life, as people will find swaps, or will say "my life would be so much easier if I didn't do this one or two. It is also pretty flexible (for example, would allow one point miles away from any others to be given a premium fairly easily).
I would approach it differently: Considering the street network as a graph, with an edge for each side of each street, find a partitioning of the graph into n segments, each no more than a given length, such that each paperboy can ride a single continuous path from the start to the end of their route. This way, you avoid giving people routes that require them to ride the same segments repeatedly (eg, when asked to cover both sides of a street without covering all the surrounding streets).
This is a very quick and dirty method of discovering where your "clusters" lie. This was inspired by the game "Minesweeper."
Divide your entire delivery space up into a grid of squares. Note - it will take some tweaking of the size of the grid before this will work nicely. My intuition tells me that a square size roughly the size of a physical neighbourhood block will be a good starting point.
Loop through each square and store the number of delivery locations (houses) within each block. Use a second loop (or some clever method on the first pass) to store the number of delivery points for each neighbouring block.
Now you can operate on this grid in a similar way to photo manipulation software. You can detect the edges of clusters by finding blocks where some neighbouring blocks have no delivery points in them.
Finally you need a system that combines number of deliveries made as well as total distance travelled to create and assign routes. There may be some isolated clusters with just a few deliveries to be made, and one or two super clusters with many homes very close to each other, requiring multiple delivery people in the same cluster. Every home must be visited, so that is your first constraint.
Derive a maximum allowable distance to be travelled by any one delivery person on a single run. Next do the same for the number of deliveries made per person.
The first ever run of the routing algorithm would assign a single delivery person, send them to any random cluster with not all deliveries completed, let them deliver until they hit their delivery limit or they have delivered to all the homes in the cluster. If they have hit the delivery limit, end the route by sending them back to home base. If they could safely travel to the nearest cluster and then home without hitting their max travel distance, do so and repeat as above.
Once the route is finished for the current delivery person, check if there are homes that have not yet had a delivery. If so, assign another delivery person, and repeat the above algorithm.
This will generate initial routes. I would store all the info - the location and dimensions of each square, the number of homes within a square and all of its direct neighbours, the cluster to which each square belongs, the delivery people and their routes - I would store all of these in a database.
I'll leave the recalc procedure up to you - but having all the current routes, clusters, etc in a database will enable you to keep all historic routes, and also try various scenarios to see how to best to adapt to changes creating the least possible changes to existing routes.
This is a classic example of a problem that deserves an optimized solution rather than trying to solve for "The OPTIMUM". It's similar in some ways to the "Travelling Salesman Problem", but you also need to segment the locations during the optimization.
I've used three different optimization algorithms to good effect on problems like this:
Simulated Annealing
Great Deluge Algorithm
Genetic Algoritms
Using an optimization algorithm, I think you've described the following "goals":
The geographic area for each paper
boy should be minimized.
The number of subscribers served by
each should be approximately equal.
The distance travelled by each
should be about equal.
(And one you didn't state, but might
matter) The route should end where
it began.
Hope this gets you started!
* Edit *
If you don't care about the routes themselves, that eliminates goals 3 and 4 above, and perhaps allows the problem to be more tailored to your bonus requirements.
If you take demographic information into account (such as population density, subscription adoption rate and subscription cancellation rate) you could probably use the optimization techniques above to eliminate the need to rerun the algorithm at all as subscribers adopted or dropped your service. Once the clusters were optimized, they would stay in balance because the rates of each for an individual cluster matched the rates for the other clusters.
The only time you'd have to rerun the algorithm was when and external factor (such as a recession/depression) caused changes in the behavior of a demographic group.
Rather than a clustering model, I think you really want some variant of the Set Covering location model, with an additional constraint to cover the number of addresses covered by each facility. I can't really find a good explanation of it online. You can take a look at this page, but they're solving it using areal units and you probably want to solve it in either euclidean or network space. If you're willing to dig up something in dead tree format, check out chapter 4 of Network and Discrete Location by Daskin.
Good survey of simple clustering algos. There is more though:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html
Perhaps a minimum spanning tree of the customers, broken into set based on locality to the paper boy. Prims or Kruskal to get the MST with the distance between houses for the weight.
I know of a pretty novel approach to this problem that I have seen applied to Bioinformatics, though it is valid for any sort of clustering problem. It's certainly not the simplest solution but one that I think is very interesting. The basic premise is that clustering involves multiple objectives. For one you want to minimise the number of clusters, the trival solution being a single cluster with all the data. The second standard objective is to minimise the amount of variance within a cluster, the trivial solution being many clusters each with only a single data point. The interesting solutions come about when you try to include both of these objectives and optimise the trade-off.
At the core of the proposed approach is something called a memetic algorithm that is a little like a genetic algorithm, which steve mentioned, however it not only explores the solution space well but also has the ability to focus in on interesting regions, i.e. solutions. At the very least I recommend reading some of the papers on this subject as memetic algorithms are an unusual approach, though a word of warning; it may lead you to read The Selfish Gene and I still haven't decided whether that was a good thing... If algorithms don't interest you then maybe you can just try and express your problem as the format requires and use the source code provided. Related papers and code can be found here: Multi Objective Clustering
This is not directly related to the problem, but something I've heard and which should be considered if this is truly a route-planning problem you have. This would affect the ordering of the addresses within the set assigned to each driver.
UPS has software which generates optimum routes for their delivery people to follow. The software tries to maximize the number of right turns that are taken during the route. This saves them a lot of time on deliveries.
For people that don't live in the USA the reason for doing this may not be immediately obvious. In the US people drive on the right side of the road, so when making a right turn you don't have to wait for oncoming traffic if the light is green. Also, in the US, when turning right at a red light you (usually) don't have to wait for green before you can go. If you're always turning right then you never have to wait for lights.
There's an article about it here:
http://abcnews.go.com/wnt/story?id=3005890
You can have K means or expected maximization remain as unchanged as possible by using the previous cluster as a clustering feature. Getting each cluster to have the same amount of items seems bit trickier. I can think of how to do it as a post clustering step by doing k means and then shuffling some points until things balance but that doesn't seem very efficient.
A trivial answer which does not get any bonus points:
One delivery person for each address.
You have a set of street
addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is
assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed,
I can always hire more delivery
persons, or lay them off.
Each cluster should have about the same number of addresses. However,
a cluster may have less addresses if a
cluster's addresses are more spread
out. (Worded another way: minimum
number of clusters where each cluster
contains a maximum number of
addresses, and any address within
cluster must be separated by a maximum
distance.)
For bonus points, when the data set is altered (address added or
removed), and the algorithm is re-run,
it would be nice if the clusters
remained as unchanged as possible (ie.
this rules out simple k-means
clustering which is random in nature).
Otherwise the delivery persons will go
crazy.
As has been mentioned a Vehicle Routing Problem is probably better suited... Although strictly isn't designed with clustering in mind, it will optimize to assign based on the nearest addresses. Therefore you're clusters will actually be the recommended routes.
If you provide a maximum number of deliverers then and try to reach the optimal solution this should tell you the min that you require. This deals with point 2.
The same number of addresses can be obtained by providing a limit on the number of addresses to be visited, basically assigning a stock value (now its a capcitated vehicle routing problem).
Adding time windows or hours that the delivery persons work helps reduce the load if addresses are more spread out (now a capcitated vehicle routing problem with time windows).
If you use a nearest neighbour algorithm then you can get identical results each time, removing a single address shouldn't have too much impact on your final result so should deal with the last point.
I'm actually working on a C# class library to achieve something like this, and think its probably the best route to go down, although not neccesairly easy to impelement.
I acknowledge that this will not necessarily provide clusters of roughly equal size:
One of the best current techniques in data clustering is Evidence Accumulation. (Fred and Jain, 2005)
What you do is:
Given a data set with n patterns.
Use an algorithm like k-means over a range of k. Or use a set of different algorithms, the goal is to produce an ensemble of partitions.
Create a co-association matrix C of size n x n.
For each partition p in the ensemble:
3.1 Update the co-association matrix: for each pattern pair (i, j) that belongs to the same cluster in p, set C(i, j) = C(i, j) + 1/N.
Use a clustering algorihm such as Single Link and apply the matrix C as the proximity measure. Single Link gives a dendrogram as result in which we choose the clustering with the longest lifetime.
I'll provide descriptions of SL and k-means if you're interested.
I would use a basic algorithm to create a first set of paperboy routes according to where they live, and current locations of subscribers, then:
when paperboys are:
Added: They take locations from one or more paperboys working in the same general area from where the new guy lives.
Removed: His locations are given to the other paperboys, using the closest locations to their routes.
when locations are:
Added : Same thing, the location is added to the closest route.
Removed: just removed from that boy's route.
Once a quarter, you could re-calculate the whole thing and change all the routes.

Resources