Strategy to find your best route via Public Transportation only?

Strategy to find your best route via Public Transportation only? - algorithm

Finding routes for a car is pretty easy: you store a weighted graph of all the roads and you could use Djikstra's algorithm [1]. A bus route is less obvious. With a bus you have to represent things like "wait 10 minutes for the next bus" or "walk one block to another bus stop" and feed those into your pathfinding algorithm.
It's not even always simple for cars. In some cities, some roads are one-way-only into the city in the morning, and one-way-only out of the city in the evening. Some advanced GPSs know how to avoid busy routes during rush hour.
How would you efficiently represent this kind of time-dependent graph and find a route? There is no need for a provably optimal solution; if the traveler wanted to be on time, they would buy a car. ;-)
[1] A wonderful algorithm to mention in an example because everyone's heard of it, though A* is a likelier choice for this application.

I have been involved in development of one journy planner system for Stockholm Public Transportation in Sweden. It was based on Djikstra's algorithm but with termination before every node was visited in the system. Today when there are reliable coordinates available for each stop, I guess the A* algorithm would be the choise.
Data about upcoming trafic was extracted from several databases regularly and compiled into large tables loaded into memory of our search server cluster.
One key to a sucessfull algorith was using a path cost function based on travel and waiting time multiplied by diffrent weightes. Known in Swedish as “kresu"-time these weighted times reflect the fact that, for example, one minute’s waiting time is typically equivalent in “inconvenience” to two minutes of travelling time.
KRESU Weight table
x1 - Travel time
x2 - Walking between stops
x2 - Waiting at a stop
during the journey. Stops under roof,
with shops, etc can get a slightly
lower weight and crowded stations a
higher to tune the algorithm.
The weight for the waiting time at the first stop is a function of trafic intensity and can be between 0.5 to 3.
Data structure
Area
A named area where you journey can start or end. A Bus Stop could be an area with two Stops. A larger Station with several platforms could be one area with one stop for each platform.
Data: Name, Stops in area
Stops
An array with all bus stops, train and underground stations. Note that you usually need two stops, one for each direction, because it takes some time to cross the street or walk to the other platform.
Data: Name, Links, Nodes
Links
A list with other stops you can reach by walking from this stop.
Data: Other Stop, Time to walk to other Stop
Lines/Tours
You have a number on the bus and a destination. The bus starts at one stop and passes several stops on its way to the destination.
Data: Number, Name, Destination
Nodes
Usually you have a timetable with the least the time for when it should be at the first and last stop in a Tour. Each time a bus/train passes a stop you add a new node to the array. This table can have millions of values per day.
Data: Line/Tour, Stop, Arrival Time, Departure Time, Error margin, Next Node in Tour
Search
Array with the same size as the Nodes array used to store how you got there and the path cost.
Data: Back-link with Previous Node (not set if Node is unvisited), Path Cost (infinit for unvisited)

What you're talking about is more complicated than something like the mathematical models that can be described with simple data structures like graphs and with "simple" algorithms like Djikstra's. What you are asking for is a more complex problem like those encountered in the world of automated logistics management.
One way to think about it is that you are asking a multi-dimensional problem, you need to be able to calculate:
Distance optimization
Time optimization
Route optimization
"Time horizon" optimization (if it's 5:25 and the bus only shows up at 7:00, pick another route.)
Given all of these circumstances you can attempt to do deterministic modeling using complex multi-layered data structures. For example, you could still use a weighted di-graph to represent the existing potential routes, wherein each node also contained a finite state automata which added a weight bias to a route depending on time values (so by crossing a node at 5:25 you get a different value than if your simulation crossed it at 7:00.)
However, I think that at this point you are going to find yourself with a simulation that is more and more complex, which most likely does not provide "great" approximation of optimal routes when the advice is transfered into the real world. It turns out that software and mathematical modeling and simulation is at best a weak tool when encountering real world chaotic behaviors and dynamism.
My suggestion would go to use an alternate strategy. I would attempt to use a genetic algorithm in which the DNA for an individual calculated a potential route, I would then create a fitness function which encoded costs and weights in a more "easy to maintain" lookup table fashion. Then I would let the Genetic Algorithm attempt to converge on a near optimal solution for a public transport route finder. On modern computers a GA such as this is probably going to perform reasonably well, and it should be at least relatively robust to real world dynamism.
I think that most systems that do this sort of thing take the "easy way out" and simply do something like an A* search algorithm, or something similar to a greedy costed weighted digraph walk. The thing to remember is that the users of the public transport don't themselves know what the optimal route would be, so a 90% optimal solution is still going to be a great solution for the average case.

Some data points to be aware of from the public transportation arena:
Each transfer incurs a 10 minute penalty (unless it is a timed transfer) in the riders mind. That is to say mentally a trip involving a single bus that takes 40 minutes is roughly equivalent to a 30minute trip that requires a transfer.
Maximum distance that most people are willing to walk to a bus stop is 1/4 mile. Train station / Light rail about 1/2 mile.
Distance is irrelevant to the public transportation rider. (Only time is important)
Frequency matters (if a connection is missed how long until the next bus). Riders will prefer more frequent service options if the alternative is being stranded for an hour for the next express.
Rail has a higher preference than bus ( more confidence that the train will come and be going in the right direction)
Having to pay a new fare is a big hit. (add about a 15-20min penalty)
Total trip time matters as well (with above penalties)
How seamless is the connect? Does the rider have to exist a train station cross a busy street? Or is it just step off a train and walk 4 steps to a bus?
Crossing busy streets -- another big penalty on transfers -- may miss connection because can't get across street fast enough.

if the cost of each leg of the trip is measured in time, then the only complication is factoring in the schedule - which just changes the cost at each node to a function of the current time t, where t is just the total trip time so far (assuming schedules are normalized to start at t=0).
so instead of Node A having a cost of 10 minutes, it has a cost of f(t) defined as:
t1 = nextScheduledStop(t); //to get the next stop time at or after time t
baseTime for leg = 10 //for example, a 10-minute trip
return (t1-t)+baseTime
wait-time is thus included dynamically in the cost of each leg, and walks between bus stops are just arcs with a constant time cost
with this representation you should be able to apply A* or Dijkstra's algorithm directly

Finding routes for a car is pretty
easy: you store a weighted graph of
all the roads and you could use
Djikstra's algorithm. A bus route
is less obvious.
It may be less obvious, but the reality is that it's merely another dimension to the car problem, with the addition of infinite cost calculation.
For instance, you mark the buses whose time is past as having infinite cost - they then aren't included in the calculation.
You then get to decide how to weight each aspect.
Transit Time might get weighted by 1
Waiting time might get weighted by 1
Transfers might get weighted by 0.5 (since I'd rather get there sooner and have an extra transfer)
Then you calculate all the routes in the graph using any usual cost algorithm with the addition of infinite cost:
Each time you move along an edge you have to keep track of 'current' time (add up the transit time) and if you arrive at a vector you have to assign infinite cost to any buses that are prior to your current time. The current time is incremented by the waiting time at that vector until the next bus leaves, then you're free to move along another edge and find the new cost.
In other words, there's a new constraint, "current time" which is the time of the first bus starting, summed with all the transit and waiting times of buses and stops traveled.
It complicates the algorithm only a little bit, but the algorithm is still the same. You can see that most algorithms can be applied to this, some might require multiple passes, and a few won't work because you can't add the time-->infinite cost calculation inline. But most should work just fine.
You can simplify it further by simply assuming that the buses are on a schedule, and there's ALWAYS another bus, but it increases the waiting time. Do the algorithm only adding up the transit costs, then go through the tree again and add waiting costs depending on when the next bus is coming. It will sometimes result in less efficient versions, but the total graph of even a large city is actually pretty small, so it's not really an issue. In most cases one or two routes will be the obvious winners.
Google has this, but also includes additional edges for walking from one bus stop to another so you might find a slightly more optimal route if you're willing to walk in cities with large bus systems.
-Adam

The way I think of this problem is that ultimately you are trying to optimize your average speed from your starting point to your ending point. In particular, you don't care at all about total distance traveled if going [well] out of your way saves time. So, a basic part of the solution space is going to need to be identifying efficient routes available that cover non-trivial parts of the total distance at relatively high speeds between start and finish.
To your original point, the typical automotive route algorithms used by GPS navigation units to make the trip by car is a good bound for a target optimal total time and optimal route evaluations. In other words, your bus based trip would be doing really good to approach a car based solution. Clearly, the bus route based system is going to have many more constraints than the car based solutions, but having the car solution as a reference (time and distance) gives the bus algorithm a framework to optimize against*. So, put loosely, you want to morph the car solution towards the set of possible bus solutions in an iterative fashion or perhaps more likely take possible bus solutions and score them against your car based solution to know if you are doing "good" or not.
Making this somewhat more concrete, for a specific departure time there are only going to be a limited number of buses available within any reasonable period of time that can cover a significant percentage of your total distance. Based on the straight automotive analysis reasonable period of time and significant percentage of distance become quantifiable using some mildly subjective metrics. Certainly, it becomes easier to score each possibility relative to the other in a more absolute sense.
Once you have a set of possible major segment(s) available as possible answers within the solution, you then need to hook them together with other possible walking and waiting paths....or if sufficiently far apart recursive selection of additional short bus runs. Intuitively, it doesn't seem that there is really going to be a prohibitive set of choices here because of the Constraints Paradox (see footnote below). Even if you can't brute force all possible combinations from there, what remains should be able to be optimized using a simulated annealing (SA) type algorithm. A Monte Carlo method would be another option.
The way we've broken the problem down to this point leaves us something that is quite analogous to how SA algorithms are applied to the automated layout and routing of ASIC chips, FPGA's and also the placement and routing of printed circuit boards of which there is quite a bit of published work on optimizing that type of problem form.
* Note: I usually refer to this as "The Constraints Paradox" - my term. While people can naturally think of more constrained problems as harder to solve, the constraints reduce choices and less choices means easier to brute force. When you can brute force, then even the optimal solution is available.

Basically, a node in your graph should not only represent a location, but also the earliest time you can get there. You can think of it as graph exploration in the (place,time) space. Additionally, if you have (place, t1) and (place,t2) where t1<t2, discard (place,t2).
Theoretically, this will get the earliest arrival time for all possible destinations from your starting node. In practice, you need some heuristic to prune roads that take you too far away from your destination.
You also need some heuristic to consider the promising routes before the less promising ones - if a route leads away from your destination, it is less likely (but not totally unlikely) to be good.

I think Your problem is more complicated than You expect. Recent COST action is focused on solving this problem: http://www.cost.esf.org/domains_actions/tud/Actions/TU1004 : "Modelling Public Transport Passenger Flows in the Era of Intelligent Transport Systems".
From my point of view regular SPS algorithms are not suitable for this. You have dynamic network state, where certain options to travel forward are incotinuous (route is always "opened" for car, bike, pedestrain, while transit connection is available only at certain dwell time).
I think new polycriterial (time, reliability, cost, comfort, and more criteria) approach is desired here. It needs to be computed real-time to 1) publish information to end user within short time 2) be able to adjust path in real-time (based on real-time traffic conditions - from ITS).
I'm about to think about this problem for the next several months (maybe even throughout a PhD thesis).
Regards
Rafal

I dont think there is any other special data structure that would cater for these specific needs but you can still use the normal data structures like a linked list and then make route calculations per given factor-you are going to need some kind of input into your app of the variables that affect the result and then make calculations accordingly i.e depending on the input.
As for the waiting and stuff, these are factors that are associated with a particular node right? You can translate this factor into a route node for each of the branches attached to the node. For example you can say for every branch from Node X, if there is a wait for say m minutes on Node X, then scale up the weight of the branch by
[m/Some base value*100]% (just an example). In this way, you have factored in the other factors uniformly but at the same time maintaining a simple representation of the problem you want to solve.

If I was tackling this problem, I'd probably start with an annotated graph. Each node on the graph would represent every intersection in the city, whether or not the public transit system stops there - this helps account for the need to walk, etc. On intersections with transit service, you annotate these with stop labels - the labels allowing you to lookup the service schedule for the stop.
Then you have a choice to make. Do you need the best possible route, or merely a route? Are you displaying the routes in real time, or can solutions be calculated and cached?
If you need "real time" calculation, you'll probably want to go with a greedy algorithm of sorts, I think an A* algorithm would probably fit this problem domain fairly nicely.
If you need optimal solutions, you should look at dynamic programming solutions to the graph... optimal solutions will likely take much longer to calculate, but you only need to find them once, then they can be cached. Perhaps your A* algorithm could use pre-calculated optimal paths to inform its decisions about "similar" routes.

A horribly inefficient way that might work would be to store a copy of each intersection in the city for each minute of the day. A bus route from Elm St. and 2nd to Main St. and 25th would be represented as, say,
elm_st_and_2nd[12][30].edges :
elm_st_and_1st[12][35] # 5 minute walk to the next intersection
time = 5 minutes
transport = foot
main_st_and_25th[1][15] # 40 minute bus ride
time = 40 minutes
transport = bus
elm_st_and_1st[12][36] # stay in one place for one minute
time = 1 minute
transport = stand still
Run your favorite pathfinding algorithm on this graph and pray for a good virtual memory implementation.

You're answering the question yourself. Using A* or Dijkstra's algorithm, all you need to do is decide on a good cost per part of each route.
For the bus route, you're implying that you don't want the shortest, but the fastest route. The cost of each part of the route must therefore include the average travel speed of a bus in that part, and any waits at bus stops.
The algorithm for finding the most suitable route is then still the same as before. With A*, all the magic happens in the cost function...

You need to weight the legs differently. For example - on a rainy day I think someone might prefer to travel longer in a vehicle than walk in the rain. Additionally, someone who detests walking or is unable to walk might make a different/longer trip than someone who would not mind walking.
These edges are costs, but I think you can expand the notion/concept of costs and they can have different relative values.

The algorithm remains the same, you just increase the weight of each graph edge according to different scenarios (Bus schedules etc).
I put together a subway route finder as an exercise in graph path finding some time ago:
http://gfilter.net/code/pathfinderDemo.aspx

Related

How can one solve a network flow problem with storage tanks?

For quite some time now, I've been hacking away at this problem but never have managed to come up with an entirely satisfactory solution. It concerns network flow - where you have a graph of nodes which are imagined to have some kind of resource flowing between them, like water in pipes, or traffic on a road system, and so forth.
Such network flow problems seem to be usually given in terms of three types of nodes only: sources (i.e. resource is generated or at least emplaced into the network there), routers or junctions (splits or combines resource conservatively), and sinks (consumes, disposes, etc. of resource). And then we do something like ask how we can solve for the flows on the edges so as to try and figure out the best way to use what is available from the sources to meet the demand from the sinks, i.e. to compute the maximum flow.
But what I am interested in is how you deal with this when you add a fourth component into the mix: tanks, or parts which can "fill up" with resource to later discharge it. From the perspective of the network and depending on the amount of resource they contain, they can seemingly act like all three of the other components depending on their capacity and how they are hooked up - note that a tank can both have things feeding it and things drawing from it simultaneously, or have only feeders or only draws, so it can act in all three roles above. Moreover, depending on whether it contains content or empty space, it can likewise also change role - an empty tank cannot act as a source, obviously, nor can a full tank act as a sink, as it can't fit any more stuff into it.
For example, if the flow solver is given something like this:
then it should put a rate of 50 units/sec of flow on the left edge, 5 units/sec on the right edge, because the tank can absorb 45 units/sec.
But if the tank is hooked like this:
then the solver should put 45 units on the vertical edge as flowing out from the tank, and 5 units flowing from the source, to meet the total demand of 50 from the sink.
That is, in a graph involving a tank, the tank should "supplement" flow provided from sources to meet demand from sinks, or else should "absorb" excess flow that did not have corresponding demand. However, it must only do this while respecting what it can reach or what can reach it from the connections provided by the edges. Note here my drawings are perhaps oversimplified as they ignore the edge directions, but the intent is that the edge leading up from the tank in the second one is directed into the junction. Thus, the behavior in a different case where the source were to advertise +50 and the sink -5 should just be to route 5 U/s from the source to the sink, i.e. the usual max-flow, and the tank would not contribute any flow. If it had a bidirectional edge, then in this case it should absorb 45 U/s from the source, while in the original case behaving no different from the unidirectional case.
How can one create an algorithm to reliably generate such solutions, given only the graph and which nodes are tanks, junctions, sources, and sinks and what the supply from the sources and demand from the sinks are?

If you assume that your tanks have infinite capacity ( they can absorb an infinite quantity at the 'produce' rate AND be drawn down for an infinite quantity at the 'consume' rate, then you can solve the problem using normal graph flow algorithms.
If the tanks have finite capacity, i.e. they change their behavior when they run dry or become full, then the solution changes with time and times depend on the initial levels of the tank. If the tank capacities are large relative to the flow rates, the solutions will be steady state for significant periods. So you create multiple graphs, representing every possible combination of the three tanks states ( full, empty, or partial ) for each tank and solve each using graph theory. This will only be feasible if the number of tanks is modest.
If you have many tanks, and you are interested in the time behavior of your system. you will have to use a simulation approach.
There are many generic simulation packages available that can be configured to solve this problem. The challenge is to interpret the results, a task which requires good understanding of statistics.
You might also consider coding your own special purpose simulator. You do not mention your preferred coding language, but if you know C++ you can get a good start from https://github.com/JamesBremner/tankfill

Simulation Performance Metrics

This is a semi-broad question, but it's one that I feel on some level is answerable or at least approachable.
I've spent the last month or so making a fairly extensive simulation. In order to protect the interests of my employer, I won't state specifically what it does... but an analogy of what it does may be explained by... a high school dance.
A girl or boy enters the dance floor, and based on the selection of free dance partners, an optimal choice is made. After a period of time, two dancers finish dancing and are now free for a new partnership.
I've been making partner selection algorithms designed to maximize average match outcome while not sacrificing wait time for a partner too much.
I want a way to gauge / compare versions of my algorithms in order to make a selection of the optimal algorithm for any situation. This is difficult however since the inputs of my simulation are extremely large matrices of input parameters (2-5 per dancer), and the simulation takes several minutes to run (a fact that makes it difficult to test a large number of simulation inputs). I have a few output metrics, but linking them to the large number of inputs is extremely hard. I'm also interested in finding which algorithms completely fail under certain input conditions...
Any pro tips / online resources which might help me in defining input constraints / output variables which might give clarity on an optimal algorithm?

I might not understand what you exactly want. But here is my suggestion. Let me know if my solution is inaccurate/irrelevant and I will edit/delete accordingly.
Assume you have a certain metric (say compatibility of the pairs or waiting time). If you just have the average or total number for this metric over all the users, it is kind of useless. Instead you might want to find the distribution of of this metric over all users. If nothing, you should always keep track of the variance. Once you have the distribution, you can calculate a probability that particular algorithm A is better than B for a certain metric.
If you do not have the distribution of the metric within an experiment, you can always run multiple experiments, and the number of experiments you need to run depends on the variance of the metric and difference between two algorithms.

Neural Network Basics

I'm a computer science student and for this years project, I need to create and apply a Genetic Algorithm to something. I think Neural Networks would be a good thing to apply it to, but I'm having trouble understanding them. I fully understand the concepts but none of the websites out there really explain the following which is blocking my understanding:
How the decision is made for how many nodes there are.
What the nodes actually represent and do.
What part the weights and bias actually play in classification.
Could someone please shed some light on this for me?
Also, I'd really appreciate it if you have any similar ideas for what I could apply a GA to.
Thanks very much! :)

Your question is quite complex and I don't think a small answer will fully satisfy you. Let me try, nonetheless.
First of all, there must be at least three layers in your neural network (assuming a simple feedforward one). The first is the input layer and there will be one neuron per input. The third layer is the output one and there will be one neuron per output value (if you are classifying, there might be more than one f you want to assign a "belong to" meaning to each neuron).. The remaining layer is the hidden one, which will stand between the input and output. Determining its size is a complex task as you can see in the following references:
comp.ai faq
a post on stack exchange
Nevertheless, the best way to proceed would be for you to state your problem more clearly (as weel as industrial secrecy might allow) and let us think a little more on your context.

The number of input and output nodes is determined by the number of inputs and outputs you have. The number of intermediate nodes is up to you. There is no "right" number.
Imagine a simple network: inputs( age, sex, country, married ) outputs( chance of death this year ). Your network might have a 2 "hidden values", one depending on age and sex, the other depending on country and married. You put weights on each. For example, Hidden1 = age * weight1 + sex * weight2. Hidden2 = country * weight3 + married * weight4. You then make another set of weights, Hidden3 and Hidden4 connecting to the output variable.
Then you get a data from, say the census, and run through your neural network to find out what weights best match the data. You can use genetic algorithms to test different sets of weights. This is useful if you have so many edges you could not try every possible weighting. You need to find good weights without exhaustively trying every possible set of weights, so GA lets you "evolve" a good set of weights.
Then you test your weights on data from a different census to see how well it worked.

... my major barrier to understanding this though is understanding how the hidden layer actually works; I don't really understand how a neuron functions and what the weights are for...
Every node in the middle layer is a "feature detector" -- it will (hopefully) "light up" (i.e., be strongly activated) in response to some important feature in the input. The weights are what emphasize an aspect of the previous layer; that is, the set of input weights to a neuron correspond to what nodes in the previous layer are important for that feature.
If a weight connecting myInputNode to myMiddleLayerNode is 0, then you can tell that myInputNode is not important to whatever feature myMiddleLayerNode is detecting. If, though, the weight connecting myInputNode to myMiddleLayerNode is very large (either positive or negative), you know that myInputNode is quite important (if it's very negative it means "No, this feature is almost certainly not there", while if it's very positive it means "Yes, this feature is almost certainly there").
So a corollary of this is that you want the number of your middle-layer nodes to have a correspondence to how many features are needed to classify the input: too few middle-layer nodes and it will be hard to converge during training (since every middle-layer node will have to "double up" on its feature-detection) while too many middle-layer nodes may over-fit your data.
So... a possible use of a genetic algorithm would be to design the architecture of your network! That is, use a GA to set the number of middle-layer nodes and initial weights. Some instances of the population will converge faster and be more robust -- these could be selected for future generations. (Personally, I've never felt this was a great use of GAs since I think it's often faster just to trial-and-error your way into a decent NN architecture, but using GAs this way is not uncommon.)

You might find this wikipedia page on NeuroEvolution of Augmenting Topologies (NEAT) interesting. NEAT is one example of applying genetic algorithms to create the neural network topology.

The best way to explain an Artificial Neural Network (ANN) is to provide the biological process that it attempts to simulate - a neural network. The best example of one is the human brain. So how does the brain work (highly simplified for CS)?
The functional unit (for our purposes) of the brain is the neuron. It is a potential accumulator and "disperser". What that means is that after a certain amount of electric potential (think filling a balloon with air) has been reached, it "fires" (balloon pops). It fires electric signals down any connections it has.
How are neurons connected? Synapses. These synapses can have various weights (in real life due to stronger/weaker synapses from thicker/thinner connections). These weights allow a certain amount of a fired signal to pass through.
You thus have a large collection of neurons connected by synapses - the base representation for your ANN. Note that the input/output structures described by the others are an artifact of the type of problem to which ANNs are applied. Theoretically, any neuron can accept input as well. It serves little purpose in computational tasks however.
So now on to ANNs.
NEURONS: Neurons in an ANN are very similar to their biological counterpart. They are modeled either as step functions (that signal out "1" after a certain combined input signal, or "0" at all other times), or slightly more sophisticated firing sequences (arctan, sigmoid, etc) that produce a continuous output, though scaled similarly to a step. This is closer to the biological reality.
SYNAPSES: These are extremely simple in ANNs - just weights describing the connections between Neurons. Used simply to weight the neurons that are connected to the current one, but still play a crucial role: synapses are the cause of the network's output. To clarify, the training of an ANN with a set structure and neuron activation function is simply the modification of the synapse weights. That is it. No other change is made in going from a a "dumb" net to one that produces accurate results.
STRUCTURE:
There is no "correct" structure for a neural network. The structures are either
a) chosen by hand, or
b) allowed to grow as a result of learning algorithms (a la Cascade-Correlation Networks).
Assuming the hand-picked structure, these are actually chosen through careful analysis of the problem and expected solution. Too few "hidden" neurons/layers, and you structure is not complex enough to approximate a complex function. Too many, and your training time rapidly grows unwieldy. For this reason, the selection of inputs ("features") and the structure of a neural net are, IMO, 99% of the problem. The training and usage of ANNs is trivial in comparison.
To now address your GA concern, it is one of many, many efforts used to train the network by modifying the synapse weights. Why? because in the end, a neural network's output is simply an extremely high-order surface in N dimensions. ANY surface optimization technique can be use to solve the weights, and GA are one such technique. The simple backpropagation method is alikened to a dimension-reduced gradient-based optimization technique.

Algorithm for finding the best routes for food distribution in game

I'm designing a city building game and got into a problem.
Imagine Sierra's Caesar III game mechanics: you have many city districts with one market each. There are several granaries over the distance connected with a directed weighted graph. The difference: people (here cars) are units that form traffic jams (here goes the graph weights).
Note: in Ceasar game series, people harvested food and stockpiled it in several big granaries, whereas many markets (small shops) took food from the granaries and delivered it to the citizens.
The task: tell each district where they should be getting their food from while taking least time and minimizing congestions on the city's roads.
Map example
Suppose that yellow districts need 7, 7 and 4 apples accordingly.
Bluish granaries have 7 and 11 apples accordingly.
Suppose edges weights to be proportional to their length. Then, the solution should be something like the gray numbers indicated on the edges. Eg, first district gets 4 apples from the 1st and 3 apples from the 2nd granary, while the last district gets 4 apples from only the 2nd granary.
Here, vertical roads are first occupied to the max, and then the remaining workers are sent to the diagonal paths.
Question
What practical and very fast algorithm should I use? I was looking at some papers (Congestion Games: Optimization in Competition etc.) describing congestion games, but could not get the big picture.

You want to look into the Max-flow problem. Seems like in this case it is a bipartite graph, which should make things easier to visualize.

This is a Multi-source Multi-sink Maximum Flow Problem which can easily be converted into a simple Maximum Flow Problem by creating a super source and a super sink as described in the link. There are many efficient solutions to Maximum Flow Problems.

One thing you could do, which would address the incremental update problem discussed in another answer and which might also be cheaper to computer, is forget about a globally optimal solution. Let each villager participate in something like ant colony optimization.
Consider preventing the people on the bottom-right-hand yellow node in your example from squeezing out those on the far-right-hand yellow node by allowing the people at the far-right-hand yellow node to bid up the "price" of buying resources from the right-hand blue node, which would encourage some of those from the bottom-right-hand yellow node to take the slightly longer walk to the left-hand blue node.

I agree with Larry and mathmike, it certainly seems like this problem is a specialization of network flow.
On another note, the problem may get easier if your final algorithm finds a spanning tree for each market to its resources (granaries), consumes those resources greedily based on shortest path first, then moves onto the next resource pile.
It may help to think about it in terms of using a road to max capacity first (maximizing road efficiency), rather than trying to minimize congestion.
This goes to the root of the problem - in general, it's easier to find close to optimal solutions in graph problems and in terms of game dev, close to optimal is probably good enough.
Edit: Wanted to also point out that mathmike's link to Wikipedia also talks about Maximum Flow Problem with Vertex Capacities where each of your granaries can be thought of as vertices with finite capacity.

Something you have to note, is that your game is continuous. If you have a solution X at time t, and some small change occurs (e.g: the player builds another road, or one of the cities gain more population), the solution that the Max Flow algorithms give you may change drastically, but you'd probably want the solution at t+1 to be similar to X. A totally different solution at each time step is unrealistic (1 new road is built at the southern end of the map, and all routes are automatically re-calculated).
I would use some algorithm to calculate initial solution (or when a major change happens, like an earthquake destroys 25% of the roads), but most of the time only update it incrementally: meaning, define some form of valid transformation on a solution (e.g. 1 city tries to get 1 food unit from a different granary than it does now) - you try the update (simulate the expected congestion), and keep the updated solution if its better than the existing solution. Run this step N times after each game turn or some unit of time.
Its both efficient computationally (don't need to run full Max Flow every second) and will get you more realistic, smooth changes in behavior.

It might be more fun to have a dynamic that models a behavior resulting in a good reasonable solution, rather than finding an ideal solution to drive the behavior. Suppose you plan each trip individually. If you're a driver and you need to get from point A to point B, how would you get there? You might consider a few things:
I know about typical traffic conditions at this hour and I'll try to find ways around roads that are usually busy. You might model this as an averaged traffic value at different times, as the motorists don't necessarily have perfect information about the current traffic, but may learn and identify trends over time.
I don't like long, confusing routes with a lot of turns. When planning a trip, you might penalize those with many edges.
If speed limits and traffic lights are included in your model, I'd want to avoid long stretches with low speed limits and/or a lot of traffic lights. I'd prefer freeways or highways for longer trips, even if they have more traffic.
There may be other interesting dynamics that evolve from considering the problem behaviorally rather than as a pure optimization. In real life, traffic rarely converges on optimal solutions, so a big part of the challenge in transportation engineering is coming up with incentives, penalties and designs that encourage a better solution from the natural dynamics playing out in the drivers' decisions.

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria.
Imagine you are managing newspaper delivery persons.
You have a set of street addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.
So... ideas?
UPDATE
The street network graph, as described in Arachnid's answer, is not available.

I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.
The algorithm works on a list if (x,y) coords ps that are specified as ints. It takes three other parameters as well:
radius (r): given a point, what is the radius for scanning for nearby points
max addresses (maxA): what are the maximum number of addresses (points) per cluster?
min addresses (minA): minimum addresses per cluster
Set limitA=maxA.
Main iteration:
Initialize empty list possibleSolutions.
Outer iteration: for every point p in ps.
Initialize empty list pclusters.
A worklist of points wps=copy(ps) is defined.
Workpoint wp=p.
Inner iteration: while wps is not empty.
Remove the point wp in wps. Determine all the points wpsInRadius in wps that are at a distance < r from wp. Sort wpsInRadius ascendingly according to the distance from wp. Keep the first min(limitA, sizeOf(wpsInRadius)) points in wpsInRadius. These points form a new cluster (list of points) pcluster. Add pcluster to pclusters. Remove points in pcluster from wps. If wps is not empty, wp=wps[0] and continue inner iteration.
End inner iteration.
A list of clusters pclusters is obtained. Add this to possibleSolutions.
End outer iteration.
We have for each p in ps a list of clusters pclusters in possibleSolutions. Every pclusters is then weighted. If avgPC is the average number of points per cluster in possibleSolutions (global) and avgCSize is the average number of clusters per pclusters (global), then this is the function that uses both these variables to determine the weight:
private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)
{
double weight = 0;
for (Cluster cluster : pclusters)
{
int ps = cluster.getPoints().size();
double psAvgPC = ps - avgPC;
weight += psAvgPC * psAvgPC / avgCSize;
weight += cluster.getSurface() / ps;
}
return new WeightedPClusters(pclusters, weight);
}
The best solution is now the pclusters with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC). End main iteration.
Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.
To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16, then we find 2 clusters of 16 addresses.
(source: paperboyalgorithm at sites.google.com)
Next, if we decrease the minimum number of addresses per cluster by setting minA=12, we find 3 clusters of 12/12/8 points.
(source: paperboyalgorithm at sites.google.com)
And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r here is only 5.
(source: paperboyalgorithm at sites.google.com)
Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.
Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.
(source: paperboyalgorithm at sites.google.com)
(source: paperboyalgorithm at sites.google.com)

What you are describing is a (Multi)-Vehicle-Routing-Problem (VRP). There's quite a lot of academic literature on different variants of this problem, using a large variety of techniques (heuristics, off-the-shelf solvers etc.). Usually the authors try to find good or optimal solutions for a concrete instance, which then also implies a clustering of the sites (all sites on the route of one vehicle).
However, the clusters may be subject to major changes with only slightly different instances, which is what you want to avoid. Still, something in the VRP-Papers may inspire you...
If you decide to stick with the explicit clustering step, don't forget to include your distribution in all clusters, as it is part of each route.
For evaluating the clusters using a graph representation of the street grid will probably yield more realistic results than connecting the dots on a white map (although both are TSP-variants). If a graph model is not available, you can use the taxicab-metric (|x_1 - x_2| + |y_1 - y_2|) as an approximation for the distances.

I think you want a hierarchical agglomeration technique rather than k-means. If you get your algorithm right you can stop it when you have the right number of clusters. As someone else mentioned you can seed subsequent clusterings with previous solutions which may give you a siginificant performance improvement.
You may want to look closely at the distance function you use, especially if your problem has high dimension. Euclidean distance is the easiest to understand but may not be the best, look at alternatives such as Mahalanobis.
I'm presuming that your real problem has nothing to do with delivering newspapers...

Have you thought about using an economic/market based solution? Divide the set up by an arbitrary (but constant to avoid randomness effects) split into even subsets (as determined by the number of delivery persons).
Assign a cost function to each point by how much it adds to the graph, and give each extra point an economic value.
Iterate allowing each person in turn to auction their worst point, and give each person a maximum budget.
This probably matches fairly well how the delivery people would think in real life, as people will find swaps, or will say "my life would be so much easier if I didn't do this one or two. It is also pretty flexible (for example, would allow one point miles away from any others to be given a premium fairly easily).

I would approach it differently: Considering the street network as a graph, with an edge for each side of each street, find a partitioning of the graph into n segments, each no more than a given length, such that each paperboy can ride a single continuous path from the start to the end of their route. This way, you avoid giving people routes that require them to ride the same segments repeatedly (eg, when asked to cover both sides of a street without covering all the surrounding streets).

This is a very quick and dirty method of discovering where your "clusters" lie. This was inspired by the game "Minesweeper."
Divide your entire delivery space up into a grid of squares. Note - it will take some tweaking of the size of the grid before this will work nicely. My intuition tells me that a square size roughly the size of a physical neighbourhood block will be a good starting point.
Loop through each square and store the number of delivery locations (houses) within each block. Use a second loop (or some clever method on the first pass) to store the number of delivery points for each neighbouring block.
Now you can operate on this grid in a similar way to photo manipulation software. You can detect the edges of clusters by finding blocks where some neighbouring blocks have no delivery points in them.
Finally you need a system that combines number of deliveries made as well as total distance travelled to create and assign routes. There may be some isolated clusters with just a few deliveries to be made, and one or two super clusters with many homes very close to each other, requiring multiple delivery people in the same cluster. Every home must be visited, so that is your first constraint.
Derive a maximum allowable distance to be travelled by any one delivery person on a single run. Next do the same for the number of deliveries made per person.
The first ever run of the routing algorithm would assign a single delivery person, send them to any random cluster with not all deliveries completed, let them deliver until they hit their delivery limit or they have delivered to all the homes in the cluster. If they have hit the delivery limit, end the route by sending them back to home base. If they could safely travel to the nearest cluster and then home without hitting their max travel distance, do so and repeat as above.
Once the route is finished for the current delivery person, check if there are homes that have not yet had a delivery. If so, assign another delivery person, and repeat the above algorithm.
This will generate initial routes. I would store all the info - the location and dimensions of each square, the number of homes within a square and all of its direct neighbours, the cluster to which each square belongs, the delivery people and their routes - I would store all of these in a database.
I'll leave the recalc procedure up to you - but having all the current routes, clusters, etc in a database will enable you to keep all historic routes, and also try various scenarios to see how to best to adapt to changes creating the least possible changes to existing routes.

This is a classic example of a problem that deserves an optimized solution rather than trying to solve for "The OPTIMUM". It's similar in some ways to the "Travelling Salesman Problem", but you also need to segment the locations during the optimization.
I've used three different optimization algorithms to good effect on problems like this:
Simulated Annealing
Great Deluge Algorithm
Genetic Algoritms
Using an optimization algorithm, I think you've described the following "goals":
The geographic area for each paper
boy should be minimized.
The number of subscribers served by
each should be approximately equal.
The distance travelled by each
should be about equal.
(And one you didn't state, but might
matter) The route should end where
it began.
Hope this gets you started!
* Edit *
If you don't care about the routes themselves, that eliminates goals 3 and 4 above, and perhaps allows the problem to be more tailored to your bonus requirements.
If you take demographic information into account (such as population density, subscription adoption rate and subscription cancellation rate) you could probably use the optimization techniques above to eliminate the need to rerun the algorithm at all as subscribers adopted or dropped your service. Once the clusters were optimized, they would stay in balance because the rates of each for an individual cluster matched the rates for the other clusters.
The only time you'd have to rerun the algorithm was when and external factor (such as a recession/depression) caused changes in the behavior of a demographic group.

Rather than a clustering model, I think you really want some variant of the Set Covering location model, with an additional constraint to cover the number of addresses covered by each facility. I can't really find a good explanation of it online. You can take a look at this page, but they're solving it using areal units and you probably want to solve it in either euclidean or network space. If you're willing to dig up something in dead tree format, check out chapter 4 of Network and Discrete Location by Daskin.

Good survey of simple clustering algos. There is more though:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

Perhaps a minimum spanning tree of the customers, broken into set based on locality to the paper boy. Prims or Kruskal to get the MST with the distance between houses for the weight.

I know of a pretty novel approach to this problem that I have seen applied to Bioinformatics, though it is valid for any sort of clustering problem. It's certainly not the simplest solution but one that I think is very interesting. The basic premise is that clustering involves multiple objectives. For one you want to minimise the number of clusters, the trival solution being a single cluster with all the data. The second standard objective is to minimise the amount of variance within a cluster, the trivial solution being many clusters each with only a single data point. The interesting solutions come about when you try to include both of these objectives and optimise the trade-off.
At the core of the proposed approach is something called a memetic algorithm that is a little like a genetic algorithm, which steve mentioned, however it not only explores the solution space well but also has the ability to focus in on interesting regions, i.e. solutions. At the very least I recommend reading some of the papers on this subject as memetic algorithms are an unusual approach, though a word of warning; it may lead you to read The Selfish Gene and I still haven't decided whether that was a good thing... If algorithms don't interest you then maybe you can just try and express your problem as the format requires and use the source code provided. Related papers and code can be found here: Multi Objective Clustering

This is not directly related to the problem, but something I've heard and which should be considered if this is truly a route-planning problem you have. This would affect the ordering of the addresses within the set assigned to each driver.
UPS has software which generates optimum routes for their delivery people to follow. The software tries to maximize the number of right turns that are taken during the route. This saves them a lot of time on deliveries.
For people that don't live in the USA the reason for doing this may not be immediately obvious. In the US people drive on the right side of the road, so when making a right turn you don't have to wait for oncoming traffic if the light is green. Also, in the US, when turning right at a red light you (usually) don't have to wait for green before you can go. If you're always turning right then you never have to wait for lights.
There's an article about it here:
http://abcnews.go.com/wnt/story?id=3005890

You can have K means or expected maximization remain as unchanged as possible by using the previous cluster as a clustering feature. Getting each cluster to have the same amount of items seems bit trickier. I can think of how to do it as a post clustering step by doing k means and then shuffling some points until things balance but that doesn't seem very efficient.

A trivial answer which does not get any bonus points:
One delivery person for each address.

You have a set of street
addresses, each of which is geocoded.
You want to cluster the addresses so that each cluster is
assigned to a delivery person.
The number of delivery persons, or clusters, is not fixed. If needed,
I can always hire more delivery
persons, or lay them off.
Each cluster should have about the same number of addresses. However,
a cluster may have less addresses if a
cluster's addresses are more spread
out. (Worded another way: minimum
number of clusters where each cluster
contains a maximum number of
addresses, and any address within
cluster must be separated by a maximum
distance.)
For bonus points, when the data set is altered (address added or
removed), and the algorithm is re-run,
it would be nice if the clusters
remained as unchanged as possible (ie.
this rules out simple k-means
clustering which is random in nature).
Otherwise the delivery persons will go
crazy.
As has been mentioned a Vehicle Routing Problem is probably better suited... Although strictly isn't designed with clustering in mind, it will optimize to assign based on the nearest addresses. Therefore you're clusters will actually be the recommended routes.
If you provide a maximum number of deliverers then and try to reach the optimal solution this should tell you the min that you require. This deals with point 2.
The same number of addresses can be obtained by providing a limit on the number of addresses to be visited, basically assigning a stock value (now its a capcitated vehicle routing problem).
Adding time windows or hours that the delivery persons work helps reduce the load if addresses are more spread out (now a capcitated vehicle routing problem with time windows).
If you use a nearest neighbour algorithm then you can get identical results each time, removing a single address shouldn't have too much impact on your final result so should deal with the last point.
I'm actually working on a C# class library to achieve something like this, and think its probably the best route to go down, although not neccesairly easy to impelement.

I acknowledge that this will not necessarily provide clusters of roughly equal size:
One of the best current techniques in data clustering is Evidence Accumulation. (Fred and Jain, 2005)
What you do is:
Given a data set with n patterns.
Use an algorithm like k-means over a range of k. Or use a set of different algorithms, the goal is to produce an ensemble of partitions.
Create a co-association matrix C of size n x n.
For each partition p in the ensemble:
3.1 Update the co-association matrix: for each pattern pair (i, j) that belongs to the same cluster in p, set C(i, j) = C(i, j) + 1/N.
Use a clustering algorihm such as Single Link and apply the matrix C as the proximity measure. Single Link gives a dendrogram as result in which we choose the clustering with the longest lifetime.
I'll provide descriptions of SL and k-means if you're interested.

I would use a basic algorithm to create a first set of paperboy routes according to where they live, and current locations of subscribers, then:
when paperboys are:
Added: They take locations from one or more paperboys working in the same general area from where the new guy lives.
Removed: His locations are given to the other paperboys, using the closest locations to their routes.
when locations are:
Added : Same thing, the location is added to the closest route.
Removed: just removed from that boy's route.
Once a quarter, you could re-calculate the whole thing and change all the routes.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio