Storage assignment algorithm

Storage assignment algorithm - algorithm

I'm trying to code a storage assignment algorithm but I'm not sure what the best way would be to model the warehouse the algorithm runs on.
The warehouse consists of shelves that can fit one item per storage location, walking ways and measuring points. Items can only be retrieved from the front of the storage locations denoted with broken lines. (The image below is just a basic representation of the warehouse. In the final version it is supposed to be tested with various amounts of storage locations and SKUs.
The idea is to measure the distance from a measuring point to a storage location for each SKU and minimize the overall distances.
The algorithm itself follows a two step approach:
First a simple greedy is used to find a feasible starting solution.
Secondly, the main algorithm is an adapeted version of binary search that runs multiple
binary search iterations through the set of potentially optimal maximum distance combinations obtained from the greedy before and assigns the SKUs to the storage location that minimizes the objective value.
My basic idea was to model the storage locations as a graph from each measuring point to the storage locations with arcs representing the distance but I'm not 100% sure if this would make sense.
So what are your ideas?
Disclaimer: The main idea is based on the paper 'Scattered Storage: How to Distribute Stock Keeping Units All Around a Mixed-Shelves Warehouse' published by Boysen & Weidinger in 2018.

Interesting problem. However, I think you might be looking at the problem from the wrong angle. Rather than searching for where to put it, find the "cost function" from all measuring points to all storage locations. Then have each storage location store it's cost. Then you throw all the (available) storage locations into a priority queue.
If you need a storage location pull the next one from the priority queue. If a location frees up, add it the queue.
Make a grid graph that represents the path that can be traversed. IE: if (0,0) is the top left corner. Then there is no direct connection between (0,1) and (1,1) as that storage isn't accessible from the left side. However, there is connection between (1,0) and (1,1). Once you have this, you can run all shortest paths to find all the distances. You'll likely need to be able to mark squares as either a) Measuring location, b) walkway or c) Storage location.
The cost function and related bits is the real tricky thing to get right in terms of "real world practicality". Here are some things to consider:
In simple terms you are just looking for the distances to all measuring stations from each storage location and the cost is probably the average. In more complex terms, you may need to consider throughput. By this I mean it may not make sense to take something out a storage location, put something back, take it out again, put it back again and so on. That may cause bottlenecks as now everyone is trying to store stuff in the same general area and you might have some traffic issues with too many people in the same area. In this case, you may need to add some "randomness" to the measurements. For example the middle 2 have the same weight, but are on opposite ends of the warehouse (in theory). It would be best if some randomness was used to ensure there is a 50:50 chance of either one being the next place an item goes. Though that alone isn't likely to be enough if this is a real issue.
You may not actually want to minimize distance to a all measuring locations practically speaking. There are likely cases where certain SKU's are more relevant distance wise to certain measuring locations. In which case, you may want to bias the priority value in that direction. IE: An SKU that is almost always going to be moved to M1 should more likely be placed in a storage location closer to that. Of course that requires something more complex than a priority queue to get working right as you need to be able to search the available storage locations for the one closest to M1.
You may need to consider order items can be stored. IE: If a storage location is 2 deep (all the ones you have are 1 deep), you may want to fill the location further back first. Though I suspect this probably isn't the issue.
Vertical storage locations. Once you have a 2D grid working, a 3D grid isn't significantly more complex to implement, useful if storage locations are actually multiple level allowing for items to be placed on different shelves or something (or just stacked). The issue here is just like #3. Do you fill things storage locations up top to bottom? Bottom to top? Or random order? Of course, quite possible your needs are such that vertical storage isn't possible or simply impractical (too tall, too fragile, unstackable, no shelving, etc).
#2 This can be further enhanced by keeping note of which items are being fetched/placed. The system can also track how long it typically takes to fetch/place items and direct placement of SKUs to other areas if others are expected to be in that area around the time the item is being fetched/placed in it's storage location.

Related

Finding POIs that are near or contain a certain location

I have an application that does the following:
Receives a device's location
Fetches a route (collection of POIs, or Points of Interest) assigned to that device
Determines if the device is near any of the POIs in the route
The route's POIs can be either a point with a radius, in which case it should detect if the device is within the radius of the point; or a polygon, where it should detect if the device is inside of it.
Here is a sample of a route with 3 POIs, two of them are points with different radii, and the other one is a polygon:
https://jsonblob.com/285c86cd-61d5-11e7-ae4c-fd99f61d20b8
My current algorithm is programmed in PHP with a MySQL database. When a device sends a new location, the script loads all the POIs for its route from the database into memory, and then iterates through them. For POIs that are points, it uses the Haversine formula to find if the device is within the POI's radius, and for POIs that are polygons it uses a "point in polygon" algorithm to find if the device is inside of it or not.
I would like to rewrite the algorithm with the goal of using less computing resources than the current one. We receive about 100 locations per second and they each have to be checked against routes that have about 40 POIs on average.
I can use any language and database to do so, which ones would you recommend for the best possible performance?

I'd use a database (e.g., Postgresql) that supports spatial queries.
That will let you create a spatial index that puts a bounding box around each POI. You can use this to do an initial check to (typically) eliminate the vast majority of POIs that aren't even close to the current position (i.e., where the current position isn't inside their bounding box).
Then when you've narrowed it down to a few POIs, you can test the few that are lest using roughly the algorithm you are now--but instead of testing 40 POIs per point, you might be testing only 2 or 3.
Exactly how well this will work will depend heavily upon how close to rectangular your POIs are. Circular is close enough that it tends to give pretty good results.
Others may depend--for example, a river that runs nearly north and south may work quite well. If you have a river that runs mostly diagonally, it may be worthwhile to break it up into a number of square/rectangular segments instead of treating the whole thing as a single feature, since the latter will create a bounding box with a lot of space that's quite a ways away from the river.

Dividing the world in a thousand or so locations

Background: I want to create a weather service, and since most available APIs limit the number of daily calls, I want to divide the planet in a thousand or so areas.
Obviously, internet users are not uniformly distributed, so the sampling should be finer around densely populated regions.
How should I go about implementing this?
Where can I find data regarding geographical internet user density?
The algorithm will probably be something similar to k-means. However, implementing it on a sphere with oceans may be a bit tricky. Any insight?
Finally, maybe there is a way I can avoid doing all of this?

Very similar to k-mean is the centroidal Voronoi diagram (it is the continuous version of k-means). However, this would produce a uniform tesselation of your sphere that does not account for user density as you wish.
So a similar solution is the same technique but used with a Power Diagram : a Power Diagram is a Voronoi Diagram that accounts for a density (by assigning a weight to each Voronoi seed). Such diagram can be computed using an embedding in a 3D space (instead of 2D) that consists of the first two (x,y) coordinates plus a third one which is the square root of [any large positive constant minus the weight for the given point].
Using that, you can obtain a tesselation of your domain accounting for a user density.

You don't care about internet user density in general. You care about the density of users using your service - and you don't care where those users are, you care where they ask about. So once your site has been going for more than a day you can use the locations people ask about the previous day to work out what the areas should be for the next day.
Dynamic programming on a tree is easy. What I would do for an algorithm is to build a tree of successively more finely divided cells. More cells mean a smaller error, because people get predictions for points closer to them, and you can work out the error, or at least the relative error between more cells and fewer cells. Starting from the bottom up work out the smallest possible total error contributed by each subtree, allowing it to be divided in up to 1,2,3,..N. ways. You can work out the best possible division and smallest possible error for each k=1..N for a node by looking at the smallest possible error you have already calculated for each of its descendants, and working out how best to share out the available k divisions between them.
I would try to avoid doing this by thinking of a different idea. Depending on the way you look at life, there are at least two disadvantages of this:
1) You don't seem to be adding anything to the party. It looks like you are interposing yourself between organizations that actually make weather forecasts and their clients. Organizations lose direct contact with their clients, which might for instance lose them advertising revenue. Customers get a poorer weather forecast.
2) Most sites have legal terms of service, which must clients can ignore without worrying. My guess is that you would be breaking those terms of service, and if your service gets popular enough to be noticed they will be enforced against you.

Sort POIs by distance from current location

Trover is an awesome app: it shows you a stream of discoveries (POIs) people have uploaded - sorted by the distance from any location you specify (usually your current location). The further you scroll through the feed, the farther away the displayed discoveries are. An indicator tells you quite accurately how far the currently shown discoveries are (see screenshots on Website).
This is different from most other location based apps that deliver their results (POIs) based on fixed regions (e.g. give me all Pizzerias withing a 10km radius) which can be implemented using a single spacial datastructure (or an SQL engine supporting spatial data types). Deliverying the results the way Trover does is considerably harder:
You can query POIs for arbitrary locations. Give Trover a location in the far East of Russia and it will deliver discoveries where the first one is 2000km away and continuously increasing from there.
The result list of POIs is not limited by some spatial range. If you scroll long enough through the feed you will probably see discoveries which are on the other side of the globe.
The above points require a semi-strict ordering of their POIs for any location. The fact that you can scroll down and reload more discoveries implies that they can deliver specific sections of the sorted data (e.g. give me the next 20 discoveries that are at least 100km away from my current location).
It's fast, the fetching and distance indications are instant. The discoveries must be pre-sorted. I don't know how many discoveries they have in their DB but it must be more than what you want to sort ad hoc.
I find these characteristics quite remarkable and wonder how this is implemented. Any suggestions what kind of data-structure, algorithms or caching might be used?

I don't get the question. What do want an answer to?
Edit:
They might use a graph-database where one edge represent the distance between the nodes. That way you can get the distance by the relationships of nearby POIs. You would calculate the distance and create edges to nearby nodes. To get the distance of an arbitrary point you just do a circle-distance calculation, for another node you just add up the edges value as they represent the distance (this is for the case of getting the walking,biking, or car calculation). The adding up might not be the closest way but will give a relative indication which it seems like they use.

How to subdivide a 2d game world for better collision detection

I'm developing a game which features a sizeable square 2d playing area. The gaming area is tileless with bounded sides (no wrapping around). I am trying to figure out how I can best divide up this world to increase the performance of collision detection. Rather than checking each entity for collision with all other entities I want to only check nearby entities for collision and obstacle avoidance.
I have a few special concerns for this game world...
I want to be able to be able to use a large number of entities in the game world at once. However, a % of entities won't collide with entities of the same type. For example projectiles won't collide with other projectiles.
I want to be able to use a large range of entity sizes. I want there to be a very large size difference between the smallest entities and the largest.
There are very few static or non-moving entities in the game world.
I'm interested in using something similar to what's described in the answer here: Quadtree vs Red-Black tree for a game in C++?
My concern is how well will a tree subdivision of the world be able to handle large size differences in entities? To divide the world up enough for the smaller entities the larger ones will need to occupy a large number of regions and I'm concerned about how that will affect the performance of the system.
My other major concern is how to properly keep the list of occupied areas up to date. Since there's a lot of moving entities, and some very large ones, it seems like dividing the world up will create a significant amount of overhead for keeping track of which entities occupy which regions.
I'm mostly looking for any good algorithms or ideas that will help reduce the number collision detection and obstacle avoidance calculations.

If I were you I'd start off by implementing a simple BSP (binary space partition) tree. Since you are working in 2D, bound box checks are really fast. You basically need three classes: CBspTree, CBspNode and CBspCut (not really needed)
CBspTree has one root node instance of class CBspNode
CBspNode has an instance of CBspCut
CBspCut symbolize how you cut a set in two disjoint sets. This can neatly be solved by introducing polymorphism (e.g. CBspCutX or CBspCutY or some other cutting line). CBspCut also has two CBspNode
The interface towards the divided world will be through the tree class and it can be a really good idea to create one more layer on top of that, in case you would like to replace the BSP solution with e.g. a quad tree. Once you're getting the hang of it. But in my experience, a BSP will do just fine.
There are different strategies of how to store your items in the tree. What I mean by that is that you can choose to have e.g. some kind of container in each node that contains references to the objects occuping that area. This means though (as you are asking yourself) that large items will occupy many leaves, i.e. there will be many references to large objects and very small items will show up at single leaves.
In my experience this doesn't have that large impact. Of course it matters, but you'd have to do some testing to check if it's really an issue or not. You would be able to get around this by simply leaving those items at branched nodes in the tree, i.e. you will not store them on "leaf level". This means you will find those objects quick while traversing down the tree.
When it comes to your first question. If you only are going to use this subdivision for collision testing and nothing else, I suggest that things that can never collide never are inserted into the tree. A missile for example as you say, can't collide with another missile. Which would mean that you dont even have to store the missile in the tree.
However, you might want to use the bsp for other things as well, you didn't specify that but keep that in mind (for picking objects with e.g. the mouse). Otherwise I propose that you store everything in the bsp, and resolve the collision later on. Just ask the bsp of a list of objects in a certain area to get a limited set of possible collision candidates and perform the check after that (assuming objects know what they can collide with, or some other external mechanism).
If you want to speed up things, you also need to take care of merge and split, i.e. when things are removed from the tree, a lot of nodes will become empty or the number of items below some node level will decrease below some merge threshold. Then you want to merge two subtrees into one node containing all items. Splitting happens when you insert items into the world. So when the number of items exceed some splitting threshold you introduce a new cut, which splits the world in two. These merge and split thresholds should be two constants that you can use to tune the efficiency of the tree.
Merge and split are mainly used to keep the tree balanced and to make sure that it works as efficient as it can according to its specifications. This is really what you need to worry about. Moving things from one location and thus updating the tree is imo fast. But when it comes to merging and splitting it might become expensive if you do it too often.
This can be avoided by introducing some kind of lazy merge and split system, i.e. you have some kind of dirty flagging or modify count. Batch up all operations that can be batched, i.e. moving 10 objects and inserting 5 might be one batch. Once that batch of operations is finished, you check if the tree is dirty and then you do the needed merge and/or split operations.
Post some comments if you want me to explain further.
Cheers !
Edit
There are many things that can be optimized in the tree. But as you know, premature optimization is the root to all evil. So start off simple. For example, you might create some generic callback system that you can use while traversing the tree. This way you dont have to query the tree to get a list of objects that matched the bound box "question", instead you can just traverse down the tree and execute that call back each time you hit something. "If this bound box I'm providing intersects you, then execute this callback with these parameters"

You most definitely want to check this list of collision detection resources from gamedev.net out. It's full of resources with game development conventions.
For other than collision detection only, check their entire list of articles and resources.

My concern is how well will a tree
subdivision of the world be able to
handle large size differences in
entities? To divide the world up
enough for the smaller entities the
larger ones will need to occupy a
large number of regions and I'm
concerned about how that will affect
the performance of the system.
Use a quad tree. For objects that exist in multiple areas you have a few options:
Store the object in both branches, all the way down. Everything ends up in leaf nodes but you may end up with a significant number of extra pointers. May be appropriate for static things.
Split the object on the zone border and insert each part in their respective locations. Creates a lot of pain and isn't well defined for a lot of objects.
Store the object at the lowest point in the tree you can. Sets of objects now exist in leaf and non-leaf nodes, but each object has one pointer to it in the tree. Probably best for objects that are going to move.
By the way, the reason you're using a quad tree is because it's really really easy to work with. You don't have any heuristic based creation like you might with some BSP implementations. It's simple and it gets the job done.
My other major concern is how to
properly keep the list of occupied
areas up to date. Since there's a lot
of moving entities, and some very
large ones, it seems like dividing the
world up will create a significant
amount of overhead for keeping track
of which entities occupy which
regions.
There will be overhead to keeping your entities in the correct spots in the tree every time they move, yes, and it can be significant. But the whole point is that you're doing much much less work in your collision code. Even though you're adding some overhead with the tree traversal and update it should be much smaller than the overhead you just removed by using the tree at all.
Obviously depending on the number of objects, size of game world, etc etc the trade off might not be worth it. Usually it turns out to be a win, but it's hard to know without doing it.

There are lots of approaches. I'd recommend settings some specific goals (e.g., x collision tests per second with a ratio of y between smallest to largest entities), and do some prototyping to find the simplest approach that achieves those goals. You might be surprised how little work you have to do to get what you need. (Or it might be a ton of work, depending on your particulars.)
Many acceleration structures (e.g., a good BSP) can take a while to set up and thus are generally inappropriate for rapid animation.
There's a lot of literature out there on this topic, so spend some time searching and researching to come up with a list candidate approaches. Mock them up and profile.

I'd be tempted just to overlay a coarse grid over the play area to form a 2D hash. If the grid is at least the size of the largest entity then you only ever have 9 grid squares to check for collisions and it's a lot simpler than managing quad-trees or arbitrary BSP trees. The overhead of determining which coarse grid square you're in is typically just 2 arithmetic operations and when a change is detected the grid just has to remove one reference/ID/pointer from one square's list and add the same to another square.
Further gains can be had from keeping the projectiles out of the grid/tree/etc lookup system - since you can quickly determine where the projectile would be in the grid, you know which grid squares to query for potential collidees. If you check collisions against the environment for each projectile in turn, there's no need for the other entities to then check for collisions against the projectiles in reverse.

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria.
Imagine you are managing newspaper delivery persons.
You have a set of street addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.
So... ideas?
UPDATE
The street network graph, as described in Arachnid's answer, is not available.

I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.
The algorithm works on a list if (x,y) coords ps that are specified as ints. It takes three other parameters as well:
radius (r): given a point, what is the radius for scanning for nearby points
max addresses (maxA): what are the maximum number of addresses (points) per cluster?
min addresses (minA): minimum addresses per cluster
Set limitA=maxA.
Main iteration:
Initialize empty list possibleSolutions.
Outer iteration: for every point p in ps.
Initialize empty list pclusters.
A worklist of points wps=copy(ps) is defined.
Workpoint wp=p.
Inner iteration: while wps is not empty.
Remove the point wp in wps. Determine all the points wpsInRadius in wps that are at a distance < r from wp. Sort wpsInRadius ascendingly according to the distance from wp. Keep the first min(limitA, sizeOf(wpsInRadius)) points in wpsInRadius. These points form a new cluster (list of points) pcluster. Add pcluster to pclusters. Remove points in pcluster from wps. If wps is not empty, wp=wps[0] and continue inner iteration.
End inner iteration.
A list of clusters pclusters is obtained. Add this to possibleSolutions.
End outer iteration.
We have for each p in ps a list of clusters pclusters in possibleSolutions. Every pclusters is then weighted. If avgPC is the average number of points per cluster in possibleSolutions (global) and avgCSize is the average number of clusters per pclusters (global), then this is the function that uses both these variables to determine the weight:
private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)
{
double weight = 0;
for (Cluster cluster : pclusters)
{
int ps = cluster.getPoints().size();
double psAvgPC = ps - avgPC;
weight += psAvgPC * psAvgPC / avgCSize;
weight += cluster.getSurface() / ps;
}
return new WeightedPClusters(pclusters, weight);
}
The best solution is now the pclusters with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC). End main iteration.
Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.
To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16, then we find 2 clusters of 16 addresses.
(source: paperboyalgorithm at sites.google.com)
Next, if we decrease the minimum number of addresses per cluster by setting minA=12, we find 3 clusters of 12/12/8 points.
(source: paperboyalgorithm at sites.google.com)
And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r here is only 5.
(source: paperboyalgorithm at sites.google.com)
Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.
Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.
(source: paperboyalgorithm at sites.google.com)
(source: paperboyalgorithm at sites.google.com)

What you are describing is a (Multi)-Vehicle-Routing-Problem (VRP). There's quite a lot of academic literature on different variants of this problem, using a large variety of techniques (heuristics, off-the-shelf solvers etc.). Usually the authors try to find good or optimal solutions for a concrete instance, which then also implies a clustering of the sites (all sites on the route of one vehicle).
However, the clusters may be subject to major changes with only slightly different instances, which is what you want to avoid. Still, something in the VRP-Papers may inspire you...
If you decide to stick with the explicit clustering step, don't forget to include your distribution in all clusters, as it is part of each route.
For evaluating the clusters using a graph representation of the street grid will probably yield more realistic results than connecting the dots on a white map (although both are TSP-variants). If a graph model is not available, you can use the taxicab-metric (|x_1 - x_2| + |y_1 - y_2|) as an approximation for the distances.

I think you want a hierarchical agglomeration technique rather than k-means. If you get your algorithm right you can stop it when you have the right number of clusters. As someone else mentioned you can seed subsequent clusterings with previous solutions which may give you a siginificant performance improvement.
You may want to look closely at the distance function you use, especially if your problem has high dimension. Euclidean distance is the easiest to understand but may not be the best, look at alternatives such as Mahalanobis.
I'm presuming that your real problem has nothing to do with delivering newspapers...

Have you thought about using an economic/market based solution? Divide the set up by an arbitrary (but constant to avoid randomness effects) split into even subsets (as determined by the number of delivery persons).
Assign a cost function to each point by how much it adds to the graph, and give each extra point an economic value.
Iterate allowing each person in turn to auction their worst point, and give each person a maximum budget.
This probably matches fairly well how the delivery people would think in real life, as people will find swaps, or will say "my life would be so much easier if I didn't do this one or two. It is also pretty flexible (for example, would allow one point miles away from any others to be given a premium fairly easily).

I would approach it differently: Considering the street network as a graph, with an edge for each side of each street, find a partitioning of the graph into n segments, each no more than a given length, such that each paperboy can ride a single continuous path from the start to the end of their route. This way, you avoid giving people routes that require them to ride the same segments repeatedly (eg, when asked to cover both sides of a street without covering all the surrounding streets).

This is a very quick and dirty method of discovering where your "clusters" lie. This was inspired by the game "Minesweeper."
Divide your entire delivery space up into a grid of squares. Note - it will take some tweaking of the size of the grid before this will work nicely. My intuition tells me that a square size roughly the size of a physical neighbourhood block will be a good starting point.
Loop through each square and store the number of delivery locations (houses) within each block. Use a second loop (or some clever method on the first pass) to store the number of delivery points for each neighbouring block.
Now you can operate on this grid in a similar way to photo manipulation software. You can detect the edges of clusters by finding blocks where some neighbouring blocks have no delivery points in them.
Finally you need a system that combines number of deliveries made as well as total distance travelled to create and assign routes. There may be some isolated clusters with just a few deliveries to be made, and one or two super clusters with many homes very close to each other, requiring multiple delivery people in the same cluster. Every home must be visited, so that is your first constraint.
Derive a maximum allowable distance to be travelled by any one delivery person on a single run. Next do the same for the number of deliveries made per person.
The first ever run of the routing algorithm would assign a single delivery person, send them to any random cluster with not all deliveries completed, let them deliver until they hit their delivery limit or they have delivered to all the homes in the cluster. If they have hit the delivery limit, end the route by sending them back to home base. If they could safely travel to the nearest cluster and then home without hitting their max travel distance, do so and repeat as above.
Once the route is finished for the current delivery person, check if there are homes that have not yet had a delivery. If so, assign another delivery person, and repeat the above algorithm.
This will generate initial routes. I would store all the info - the location and dimensions of each square, the number of homes within a square and all of its direct neighbours, the cluster to which each square belongs, the delivery people and their routes - I would store all of these in a database.
I'll leave the recalc procedure up to you - but having all the current routes, clusters, etc in a database will enable you to keep all historic routes, and also try various scenarios to see how to best to adapt to changes creating the least possible changes to existing routes.

This is a classic example of a problem that deserves an optimized solution rather than trying to solve for "The OPTIMUM". It's similar in some ways to the "Travelling Salesman Problem", but you also need to segment the locations during the optimization.
I've used three different optimization algorithms to good effect on problems like this:
Simulated Annealing
Great Deluge Algorithm
Genetic Algoritms
Using an optimization algorithm, I think you've described the following "goals":
The geographic area for each paper
boy should be minimized.
The number of subscribers served by
each should be approximately equal.
The distance travelled by each
should be about equal.
(And one you didn't state, but might
matter) The route should end where
it began.
Hope this gets you started!
* Edit *
If you don't care about the routes themselves, that eliminates goals 3 and 4 above, and perhaps allows the problem to be more tailored to your bonus requirements.
If you take demographic information into account (such as population density, subscription adoption rate and subscription cancellation rate) you could probably use the optimization techniques above to eliminate the need to rerun the algorithm at all as subscribers adopted or dropped your service. Once the clusters were optimized, they would stay in balance because the rates of each for an individual cluster matched the rates for the other clusters.
The only time you'd have to rerun the algorithm was when and external factor (such as a recession/depression) caused changes in the behavior of a demographic group.

Rather than a clustering model, I think you really want some variant of the Set Covering location model, with an additional constraint to cover the number of addresses covered by each facility. I can't really find a good explanation of it online. You can take a look at this page, but they're solving it using areal units and you probably want to solve it in either euclidean or network space. If you're willing to dig up something in dead tree format, check out chapter 4 of Network and Discrete Location by Daskin.

Good survey of simple clustering algos. There is more though:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

Perhaps a minimum spanning tree of the customers, broken into set based on locality to the paper boy. Prims or Kruskal to get the MST with the distance between houses for the weight.

I know of a pretty novel approach to this problem that I have seen applied to Bioinformatics, though it is valid for any sort of clustering problem. It's certainly not the simplest solution but one that I think is very interesting. The basic premise is that clustering involves multiple objectives. For one you want to minimise the number of clusters, the trival solution being a single cluster with all the data. The second standard objective is to minimise the amount of variance within a cluster, the trivial solution being many clusters each with only a single data point. The interesting solutions come about when you try to include both of these objectives and optimise the trade-off.
At the core of the proposed approach is something called a memetic algorithm that is a little like a genetic algorithm, which steve mentioned, however it not only explores the solution space well but also has the ability to focus in on interesting regions, i.e. solutions. At the very least I recommend reading some of the papers on this subject as memetic algorithms are an unusual approach, though a word of warning; it may lead you to read The Selfish Gene and I still haven't decided whether that was a good thing... If algorithms don't interest you then maybe you can just try and express your problem as the format requires and use the source code provided. Related papers and code can be found here: Multi Objective Clustering

This is not directly related to the problem, but something I've heard and which should be considered if this is truly a route-planning problem you have. This would affect the ordering of the addresses within the set assigned to each driver.
UPS has software which generates optimum routes for their delivery people to follow. The software tries to maximize the number of right turns that are taken during the route. This saves them a lot of time on deliveries.
For people that don't live in the USA the reason for doing this may not be immediately obvious. In the US people drive on the right side of the road, so when making a right turn you don't have to wait for oncoming traffic if the light is green. Also, in the US, when turning right at a red light you (usually) don't have to wait for green before you can go. If you're always turning right then you never have to wait for lights.
There's an article about it here:
http://abcnews.go.com/wnt/story?id=3005890

You can have K means or expected maximization remain as unchanged as possible by using the previous cluster as a clustering feature. Getting each cluster to have the same amount of items seems bit trickier. I can think of how to do it as a post clustering step by doing k means and then shuffling some points until things balance but that doesn't seem very efficient.

A trivial answer which does not get any bonus points:
One delivery person for each address.

You have a set of street
addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is
assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed,
I can always hire more delivery
persons, or lay them off.
Each cluster should have about the same number of addresses. However,
a cluster may have less addresses if a
cluster's addresses are more spread
out. (Worded another way: minimum
number of clusters where each cluster
contains a maximum number of
addresses, and any address within
cluster must be separated by a maximum
distance.)
For bonus points, when the data set is altered (address added or
removed), and the algorithm is re-run,
it would be nice if the clusters
remained as unchanged as possible (ie.
this rules out simple k-means
clustering which is random in nature).
Otherwise the delivery persons will go
crazy.
As has been mentioned a Vehicle Routing Problem is probably better suited... Although strictly isn't designed with clustering in mind, it will optimize to assign based on the nearest addresses. Therefore you're clusters will actually be the recommended routes.
If you provide a maximum number of deliverers then and try to reach the optimal solution this should tell you the min that you require. This deals with point 2.
The same number of addresses can be obtained by providing a limit on the number of addresses to be visited, basically assigning a stock value (now its a capcitated vehicle routing problem).
Adding time windows or hours that the delivery persons work helps reduce the load if addresses are more spread out (now a capcitated vehicle routing problem with time windows).
If you use a nearest neighbour algorithm then you can get identical results each time, removing a single address shouldn't have too much impact on your final result so should deal with the last point.
I'm actually working on a C# class library to achieve something like this, and think its probably the best route to go down, although not neccesairly easy to impelement.

I acknowledge that this will not necessarily provide clusters of roughly equal size:
One of the best current techniques in data clustering is Evidence Accumulation. (Fred and Jain, 2005)
What you do is:
Given a data set with n patterns.
Use an algorithm like k-means over a range of k. Or use a set of different algorithms, the goal is to produce an ensemble of partitions.
Create a co-association matrix C of size n x n.
For each partition p in the ensemble:
3.1 Update the co-association matrix: for each pattern pair (i, j) that belongs to the same cluster in p, set C(i, j) = C(i, j) + 1/N.
Use a clustering algorihm such as Single Link and apply the matrix C as the proximity measure. Single Link gives a dendrogram as result in which we choose the clustering with the longest lifetime.
I'll provide descriptions of SL and k-means if you're interested.

I would use a basic algorithm to create a first set of paperboy routes according to where they live, and current locations of subscribers, then:
when paperboys are:
Added: They take locations from one or more paperboys working in the same general area from where the new guy lives.
Removed: His locations are given to the other paperboys, using the closest locations to their routes.
when locations are:
Added : Same thing, the location is added to the closest route.
Removed: just removed from that boy's route.
Once a quarter, you could re-calculate the whole thing and change all the routes.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio