Matching algorithm - algorithm

Odd question here not really code but logic,hope its ok to post it here,here it is
I have a data structure that can be thought of as a graph.
Each node can support many links but is limited to a value for each node.
All links are bidirectional. and each link has a cost. the cost depends on euclidian difference between the nodes the minimum value of two parameters in each node. and a global modifier.
i wish to find the maximum cost for the graph.
wondering if there was a clever way to find such a matching, rather than going through in brute force ...which is ugly... and i'm not sure how i'd even do that without spending 7 million years running it.
To clarify:
Global variable = T
many nodes N each have E,X,Y,L
L is the max number of links each node can have.
cost of link A,B = Sqrt( min([a].e | [b].e) ) x
( 1 + Sqrt( sqrt(sqr([a].x-[b].x)+sqr([a].y-[b].y)))/75 + Sqrt(t)/10 )
total cost =sum all links.....and we wish to maximize this.
average values for nodes is 40-50 can range to (20..600)
average node linking factor is 3 range 0-10.

For the sake of completeness for anybody else that looks at this article, i would suggest revisiting your graph theory algorithms:
Dijkstra
Astar
Greedy
Depth / Breadth First
Even dynamic programming (in some situations)
ect. ect.
In there somewhere is the correct solution for your problem. I would suggest looking at Dijkstra first.
I hope this helps someone.

If I understand the problem correctly, there is likely no polynomial solution. Therefore I would implement the following algorithm:
Find some solution by beng greedy. To do that, you sort all edges by cost and then go through them starting with the highest, adding an edge to your graph while possible, and skipping when the node can't accept more edges.
Look at your edges and try to change them to archive higher cost by using a heuristics. The first that comes to my mind: you cycle through all 4-tuples of nodes (A,B,C,D) and if your current graph has edges AB, CD but AC, BD would be better, then you make the change.
Optionally the same thing with 6-tuples, or other genetic algorithms (they are called that way because they work by mutations).

This is equivalent to the traveling salesman problem (and is therefore NP-Complete) since if you could solve this problem efficiently, you could solve TSP simply by replacing each cost with its reciprocal.
This means you can't solve exactly. On the other hand, it means that you can do exactly as I said (replace each cost with its reciprocal) and then use any of the known TSP approximation methods on this problem.

Seems like a max flow problem to me.

Is it possible that by greedily selecting the next most expensive option from any given start point (omitting jumps to visited nodes) and stopping once all nodes are visited? If you get to a dead end backtrack to the previous spot where you are not at a dead end and greedily select. It would require some work and probably something like a stack to keep your paths in. I think this would work quite effectively provided the costs are well ordered and non negative.

Use Genetic Algorithms. They are designed to solve the problem you state rapidly reducing time complexity. Check for AI library in your language of choice.

Related

Finding fastest path at a cost, less or equal to a specified

Here's visualisation of my problem.
I've been trying to use djikstra on that however, It haven't worked.
The complication, as I see it, is that Dijkstra's algorithm throws away information that you need to keep around: if you are trying to get from A to E in
B
/ \
A D - E
\ /
C
And ABD is shorter than ACD, Dijkstra's will forget that ACD was ever a possibility (it uses ACD as the canonical route from A to D). But if ABD has a higher cost than ACD, and ABDE is above the quota while ACDE is below, the now eliminated ACD was correct. The problem is that Dijkstra's algorithm assumes that if one path is at least as long as another, it is weakly dominated: there is no reason to prefer it. And in one dimension of comparison, paths are weakly ordered: given any two paths, one weakly dominates the other.
But here we have two dimensions of comparison, and so ordering does not hold: one path can be shorter, the other cheaper. Since we can only discard dominated paths, we must keep all paths that do not already exceed the budget and are not dominated. I have put a bit of work into implementing this approach; it looks doable but cannot find an argument for a worst-case bound below exponential complexity (although normal performance should be much better, since in a sane graphs most paths are dominated).
You can also, as Billiska notes, use k-th shortest routes algorithms and then proceed through their results until you find one below the budget. That uses time O(m+ K*n*log(m/n)); but unless someone sees an upper bound on K such that K is guaranteed to include a path under the budget (if one exists), we need to set K to be the total number of paths, again yielding exponential complexity (although again a strategy of incrementally increasing K would likely yield a reasonable average runtime, at least if length and cost are reasonably correlated).
EDIT:
Complicating (perhaps fatally) the implementation of my proposed modification is that Dijkstra's algorithm relies on an ordering of the accessibility of nodes, such that we know that if we take the unexplored node to which we have the shortest path, we will never find a better route to it (since all other routes are already known to be longer). If that shortest route is also expensive, that need not hold; even after exploring a node, we must be prepared to update paths out of it on the basis of longer but cheaper routes into it. I suspect that this will prevent it from reaching polynomial time in the worst case.
Basically you need to find the first shortest-path, check if it works, then find the second shortest-path, check if it works, and so on...
Dijkstra's algorithm isn't designed to work with such task.
And just a Google search on this new definition of the problem,
I arrive at Stack Overflow question on finding kth-shortest-paths.
I haven't read into it yet, so don't ask me.
I hope this helps.
I think you can do it with Dijkstra, but you have to change the way you are calculating the tentative distance in each step. Instead of just taking into account the distance, consider also the cost. the new distance should be 2-d number (dist, cost), when you will choose what is the minimal distance you should take the one with minimal dist AND cost <= 6, that's it.
I hope this is correct.

Travelling salesman with repeat nodes & dynamic weights

Given a list of cities and the cost to fly between each city, I am trying to find the cheapest itinerary that visits all of these cities. I am currently using a MATLAB solution to find the cheapest route, but I'd now like to modify the algorithm to allow the following:
repeat nodes - repeat nodes should be allowed, since travelling via hub cities can often result in a cheaper route
dynamic edge weights - return/round-trip flights have a different (usually lower) cost to two equivalent one-way flights
For now, I am ignoring the issue of flight dates and assuming that it is possible to travel from any city to any other city.
Does anyone have any ideas how to solve this problem? My first idea was to use an evolutionary optimisation method like GA or ACO to solve point 2, and simply adjust the edge weights when evaluating the objective function based on whether the itinerary contains return/round-trip flights, but perhaps somebody else has a better idea.
(Note: I am using MATLAB, but I am not specifically looking for coded solutions, more just high-level ideas about what algorithms can be used.)
Edit - after thinking about this some more, allowing "repeat nodes" seems to be too loose of a constraint. We could further constrain the problem so that, although nodes can be repeatedly visited, each directed edge can only be visited at most once. It seems reasonable to ignore any itineraries which include the same flight in the same direction more than once.
I haven't tested it myself; however, I have read that implementing Simulated Annealing to solve the TSP (or variants of it) can produce excellent results. The key point here is that Simulated Annealing is very easy to implement and requires minimal tweaking, while approximation algorithms can take much longer to implement and are probably more error prone. Skiena also has a page dedicated to specific TSP solvers.
If you want the cost of the solution produced by the algorithm is within 3/2 of the optimum then you want the Christofides algorithm. ACO and GA don't have a guaranteed cost.
Solving the TSP is a NP-hard problem for its subcycles elimination constraints, if you remove any of them (for your hub cities) you just make the problem easier.
But watch out: TSP has similarities with association problem in the meaning that you could obtain non-valid itineraries like:
Cities: New York, Boston, Dallas, Toronto
Path:
Boston - New York
New York - Boston
Dallas - Toronto
Toronto - Dallas
which is clearly wrong since we don't go across all cities.
The subcycle elimination constraints serve just to this purpose. Including a 'hub city' sounds like you need to add weights to the point and make an hybrid between flux problems and tsp problems. Sounds pretty hard but the first try may be: eliminate the subcycles constraints relative to your hub cities (and leave all the others). You can then link the subcycles obtained for the hub cities together.
Good luck
Firstly, what is approximate number of cities in your problem set? (Up to 100? More than 100?)
I have a fair bit of experience with GA (not ACO), and like epitaph says, it has a bit of gambling aspect. For some input, it might stop at a brutally inefficient solution. So, what I have done in the past is to use GA as the first option, compare the answer to some lower bound, and if that seems to be "way off", then run a second (usually a less efficient) algorithm.
Of course, I used plenty of terms that were not standard, so let us make sure that we agree what they would be in this context:
lower bound - of course, in this case, MST would be a lower bound.
"Way Off" - If triangle inequality holds, then an upper bound is UB = 2 * MST. A good "way off" in this context would be 2 * UB.
Second algorithm - In this case, both a linear programming based approach and Christofides would be good choices.
If you limit the problem to round-trips (i.e. the salesman can only buy round-trip tickets), then it can be represented by an undirected graph, and the problem boils down to finding the minimum spanning tree, which can be done efficiently.
In the general case I don't know of a clever way to use efficient algorithms; GA or similar might be a good way to go.
Do you want a near-optimal solution, or do you want the optimal solution?
For the optimal solution, there's still good ol' brute force. Due to requirement 1 involving repeat nodes, you'll have to make sure you search breadth-first, not dept-first. Otherwise you can end up in an infinite loop. You can slowly drop all routes that exceed your current minimum until all routes are exhausted and the minimal route is discovered.

A* heuristic, overestimation/underestimation?

I am confused about the terms overestimation/underestimation. I perfectly get how A* algorithm works, but i am unsure of the effects of having a heuristic that overestimate or underestimate.
Is overestimation when you take the square of the direct birdview-line? And why would it make the algorithm incorrect? The same heuristic is used for all nodes.
Is underestimation when you take the squareroot of the direct birdview-line? And why is the algorithm still correct?
I can't find an article which explains it nice and clear so I hope someone here has a good description.
You're overestimating when the heuristic's estimate is higher than the actual final path cost. You're underestimating when it's lower (you don't have to underestimate, you just have to not overestimate; correct estimates are fine). If your graph's edge costs are all 1, then the examples you give would provide overestimates and underestimates, though the plain coordinate distance also works peachy in a Cartesian space.
Overestimating doesn't exactly make the algorithm "incorrect"; what it means is that you no longer have an admissible heuristic, which is a condition for A* to be guaranteed to produce optimal behavior. With an inadmissible heuristic, the algorithm can wind up doing tons of superfluous work examining paths that it should be ignoring, and possibly finding suboptimal paths because of exploring those. Whether that actually occurs depends on your problem space. It happens because the path cost is 'out of joint' with the estimate cost, which essentially gives the algorithm messed up ideas about which paths are better than others.
I'm not sure whether you will have found it, but you may want to look at the Wikipedia A* article. I mention (and link) mainly because it's almost impossible to Google for it.
From the Wikipedia A* article, the relevant part of the algorithm description is:
The algorithm continues until a goal node has a lower f value than any node in the queue (or until the queue is empty).
The key idea is that, with understimation, A* will only stop exploring a potential path to the goal once it knows that the total cost of the path will exceed the cost of a known path to the goal. Since the estimate of a path's cost is always less than or equal to the path's real cost, A* can discard a path as soon as the estimated cost exceeds the total cost of a known path.
With overestimation, A* has no idea when it can stop exploring a potential path as there can be paths with lower actual cost but higher estimated cost than the best currently known path to the goal.
Intuitive Answer
For A* to work correctly (always finding the 'best' solution, not just any), your estimation function needs to be optimistic.
Optimism here means that your expectations are always better than reality.
An optimist will try many things that might disappoint in the end, but they will find all the good opportunities.
A pessimist expects bad results, and so will not try many things. Because of this, they may miss some golden opportunities.
So for A*, being optimistic means to always underestimate the costs (i.e. "it's probably not that far"). When you do that, once you found a path, then you might still feel excited about several unexplored options, that could be even better.
That means you won't stop at the first solution, and still try those other ones. Most will probably disappoint (not be better), but it guarantees you will always find the best solution. Of course trying out more options takes more work (time).
A pessimistic A* will always overestimate cost (e.g. "that option is probably pretty bad"). Once it has found a solution and it knows the true cost of the path, every other path will seem worse (because estimates are always worse than reality), and it will never try any alternative once the goal is found.
The most effective A* is one that never under-estimates, but estimates either perfectly or just slightly over-optimistic. Then you'll not be naive and try too many bad options.
A nice lesson for everyone. Always be slightly optimistic!
Short answer
#chaos answer is bit misleading imo (can should be highlighted)
Overestimating doesn't exactly make the algorithm "incorrect"; what it means is that you no longer have an admissible heuristic, which is a condition for A* to be guaranteed to produce optimal behavior. With an inadmissible heuristic, the algorithm can wind up doing tons of superfluous work
as #AlbertoPL is hinting
You can find an answer quicker by overestimating, but you may not find the shortest path.
In the end (beside the mathematical optimum), the optimal solution strongly depends on whether you consider computing resources, runtime, special types of "Maps"/State Spaces, etc.
Long answer
As an example I could think of an realtime application where a robot gets faster to the target by using an overestimating heuristic because the time advantage by starting earlier is bigger than the time advantage by taken the shortest path but waiting longer for computing this solution.
To give you a better impression, I share some exemplary results that I quickly created with Python. The results stem from the same A* algorithm, only the heuristic differs. Each node(grid cell) has got edges to all eight neighbors except walls. Diagonal edges cost sqrt(2)=1.41
The first picture shows the returned paths of the algorithm for an simple example case. You can see some suboptimal paths from overestimating heuristics (red and cyan). On the other hand there are multiple optimal paths (blue, yellow, green) and it depends on the heuristic which one is found first.
The different images show all expanded nodes when the target is reached. The color shows the estimated path cost using this node (considering the "already taken" path from start to this node as well; mathematically it's the cost so far plus the heuristic for this node). At any time the algorithm expands the node with lowest estimated total cost (described before).
1. Zero (blue)
Corresponds to the Dijkstra algorithm
Nodes expanded: 2685
Path length: 89.669
2. As the crow flies (yellow)
Nodes expanded: 658
Path length: 89.669
3. Ideal (green)
Shortest path without obstacles (if you follow the eight directions)
Highest possible estimate without overestimating (hence "ideal")
Nodes expanded: 854
Path length: 89.669
4. Manhattan (red)
Shortest path without obstacles (if you don't move diagonally; in other words: cost of "moving diagonally" is estimated as 2)
Overestimates
Nodes expanded: 562
Path length: 92.840
5. As the crow flies times ten (cyan)
Overestimates
Nodes expanded: 188
Path length: 99.811
As far as I know, you want to typically underestimate so that you may still find the shortest path. You can find an answer quicker by overestimating, but you may not find the shortest path. Hence why overestimation is "incorrect", whereas underestimating can still provide the best solution.
I'm sorry that I cannot provide any insight as to the birdview lines...
Consider heuristic as f(x)=g(x)+h(x), where g(x) is the real cost from start-node to current-node, and h(x) the prediction cost from current-node to goal. Assume the optimal cost is R then:
The h(x) makes difference in the early stage of the searching. Given three node A,B,C
(*) => current pos: A
A -------> B - 。。。 -> C
|_______________________| => the prediction range of h(x)
Once you step on B, the cost from A to B is truth, the prediction h(x) doesn't include it anymore:
(*) => current pos: B
A -------> B - 。。。 -> C
|____________| => the prediction range of h(x)
When we say under-estimate, it means that your h(x) will cause f(x) < R for all x on the way to goal.
Over-estimation indeed makes the algorithm incorrect:
Assume R is 19. And given that the two cost 20, 21 are the cost of the paths that already reach the goal:
Front Rear
------------------------- => This is a priority queue PQ.
| 20 | 20 | 30 | ... | 99 |
^-------- => This is the "fake" optimal.
But say f(y)=g(y)+h(y), and y is indeed on the right path to achieve the optimal cost R, but since h(y) is over-estimated, so the f(y) is currently 30 in the PQ, so before we can update 30 to 19, the algorithm already will pop 20 from the PQ and wrongly assume that it were an "optimal" solution.

Hill climbing and single-pair shortest path algorithms

I have a bit of a strange question. Can anyone tell me where to find information about, or give me a little bit of an introduction to using shortest path algorithms that use a hill climbing approach? I understand the basics of both, but I can't put the two together. Wikipedia has an interesting part about solving the Travelling Sales Person with hill climbing, but doesn't provide a more in-depth explanation of how to go about it exactly.
For example, hill climbing can be
applied to the traveling salesman
problem. It is easy to find a solution
that visits all the cities but will be
very poor compared to the optimal
solution. The algorithm starts with
such a solution and makes small
improvements to it, such as switching
the order in which two cities are
visited. Eventually, a much better
route is obtained.
As far as I understand it, you should pick any path and then iterate through it and make optimisations along the way. For instance going back and picking a different link from the starting node and checking whether that gives a shorter path.
I am sorry - I did not make myself very clear. I understand how to apply the idea to Travelling Salesperson. I would like to use it on a shortest distance algorithm.
You could just randomly exchange two cities.
You first path is: A B C D E F A with length 200
Now you change it by swapping C and D: A B D C E F A with length 350 - Worse!
Next step: A B C D F E A: length 150 - You improved your solution. ;-)
Hill climbing algorithms are really easy to implement but have several problems with local maxima! [A better approch based on the same idea is simulated annealing.]
Hill climbing is a very simple kind of evolutionary optimization, a much more sophisticated algorithm class are genetic algorithms.
Another good metaheuristic for solving the TSP is ant colony optimization
Examples would be genetic algorithms or expectation maximization in data clustering. With an iteration of single steps it is tried to come to a better solution with every step. The problem is that it only finds a local maximum/minimum, it is never assured that it finds the global maximum/minimum.
A solution for the travelling salesman problem as a genetic algorithm for which we need:
Representation of the solution as order of visited cities, e.g. [New York, Chicago, Denver, Salt Lake City, San Francisco]
Fitness function as the travelled distance
Selection of the best results is done by selecting items randomly depending on their fitness, the higher the fitness, the higher the probability that the solution is chosen to survive
Mutation would be switching to cities in a list, like [A,B,C,D] becomes [A,C,B,D]
Crossing of two possible solutions [B,A,C,D] and [A,B,D,C] result in [B,A,D,C], i.e. cutting both list in the middle and use the beginning of one parent and the end of the other parent to form the child
The algorithm then:
initalization of the starting set of solution
calculation of the fitness of every solution
until desired maximum fitness or until no changes happen any more
selection of the best solutions
crossing and mutation
fitness calculation of every solution
It is possible that with every execution of the algorithm the result is differently, therefore it should be executed more then once.
I'm not sure why you would want to use a hill-climbing algorithm, since Djikstra's algorithm is polynomial complexity O( | E | + | V | log | V | ) using Fibonacci queues:
http://en.wikipedia.org/wiki/Dijkstra's_algorithm
If you're looking for an heuristic approach to the single-path problem, then you can use A*:
http://en.wikipedia.org/wiki/A*_search_algorithm
but an improvement in efficiency is dependent on having an admissible heuristic estimate of the distance to the goal.
http://en.wikipedia.org/wiki/A*_search_algorithm
To hillclimb the TSP you should have a starting route. Of course picking a "smart" route wouldn't hurt.
From that starting route you make one change and compare the result. If it's higher you keep the new one, if it's lower keep the old one. Repeat this until you reach a point from where you can't climb anymore, which becomes your best result.
Obviously, with TSP, you will more than likely hit a local maximum. But it is possible to get decent results.

What's the most insidious way to pose this problem?

My best shot so far:
A delivery vehicle needs to make a series of deliveries (d1,d2,...dn), and can do so in any order--in other words, all the possible permutations of the set D = {d1,d2,...dn} are valid solutions--but the particular solution needs to be determined before it leaves the base station at one end of the route (imagine that the packages need to be loaded in the vehicle LIFO, for example).
Further, the cost of the various permutations is not the same. It can be computed as the sum of the squares of distance traveled between di -1 and di, where d0 is taken to be the base station, with the caveat that any segment that involves a change of direction costs 3 times as much (imagine this is going on on a railroad or a pneumatic tube, and backing up disrupts other traffic).
Given the set of deliveries D represented as their distance from the base station (so abs(di-dj) is the distance between two deliveries) and an iterator permutations(D) which will produce each permutation in succession, find a permutation which has a cost less than or equal to that of any other permutation.
Now, a direct implementation from this description might lead to code like this:
function Cost(D) ...
function Best_order(D)
for D1 in permutations(D)
Found = true
for D2 in permutations(D)
Found = false if cost(D2) > cost(D1)
return D1 if Found
Which is O(n*n!^2), e.g. pretty awful--especially compared to the O(n log(n)) someone with insight would find, by simply sorting D.
My question: can you come up with a plausible problem description which would naturally lead the unwary into a worse (or differently awful) implementation of a sorting algorithm?
I assume you're using this question for an interview to see if the applicant can notice a simple solution in a seemingly complex question.
[This assumption is incorrect -- MarkusQ]
You give too much information.
The key to solving this is realizing that the points are in one dimension and that a sort is all that is required. To make this question more difficult hide this fact as much as possible.
The biggest clue is the distance formula. It introduces a penalty for changing directions. The first thing an that comes to my mind is minimizing this penalty. To remove the penalty I have to order them in a certain direction, this ordering is the natural sort order.
I would remove the penalty for changing directions, it's too much of a give away.
Another major clue is the input values to the algorithm: a list of integers. Give them a list of permutations, or even all permutations. That sets them up to thinking that a O(n!) algorithm might actually be expected.
I would phrase it as:
Given a list of all possible
permutations of n delivery locations,
where each permutation of deliveries
(d1, d2, ...,
dn) has a cost defined by:
Return permutation P such that the
cost of P is less than or equal to any
other permutation.
All that really needs to be done is read in the first permutation and sort it.
If they construct a single loop to compare the costs ask them what the big-o runtime of their algorithm is where n is the number of delivery locations (Another trap).
This isn't a direct answer, but I think more clarification is needed.
Is di allowed to be negative? If so, sorting alone is not enough, as far as I can see.
For example:
d0 = 0
deliveries = (-1,1,1,2)
It seems the optimal path in this case would be 1 > 2 > 1 > -1.
Edit: This might not actually be the optimal path, but it illustrates the point.
YOu could rephrase it, having first found the optimal solution, as
"Give me a proof that the following convination is the most optimal for the following set of rules, where optimal means the smallest number results from the sum of all stage costs, taking into account that all stages (A..Z) need to be present once and once only.
Convination:
A->C->D->Y->P->...->N
Stage costs:
A->B = 5,
B->A = 3,
A->C = 2,
C->A = 4,
...
...
...
Y->Z = 7,
Z->Y = 24."
That ought to keep someone busy for a while.
This reminds me of the Knapsack problem, more than the Traveling Salesman. But the Knapsack is also an NP-Hard problem, so you might be able to fool people to think up an over complex solution using dynamic programming if they correlate your problem with the Knapsack. Where the basic problem is:
can a value of at least V be achieved
without exceeding the weight W?
Now the problem is a fairly good solution can be found when V is unique, your distances, as such:
The knapsack problem with each type of
item j having a distinct value per
unit of weight (vj = pj/wj) is
considered one of the easiest
NP-complete problems. Indeed empirical
complexity is of the order of O((log
n)2) and very large problems can be
solved very quickly, e.g. in 2003 the
average time required to solve
instances with n = 10,000 was below 14
milliseconds using commodity personal
computers1.
So you might want to state that several stops/packages might share the same vj, inviting people to think about the really hard solution to:
However in the
degenerate case of multiple items
sharing the same value vj it becomes
much more difficult with the extreme
case where vj = constant being the
subset sum problem with a complexity
of O(2N/2N).
So if you replace the weight per value to distance per value, and state that several distances might actually share the same values, degenerate, some folk might fall in this trap.
Isn't this just the (NP-Hard) Travelling Salesman Problem? It doesn't seem likely that you're going to make it much harder.
Maybe phrasing the problem so that the actual algorithm is unclear - e.g. by describing the paths as single-rail railway lines so the person would have to infer from domain knowledge that backtracking is more costly.
What about describing the question in such a way that someone is tempted to do recursive comparisions - e.g. "can you speed up the algorithm by using the optimum max subset of your best (so far) results"?
BTW, what's the purpose of this - it sounds like the intent is to torture interviewees.
You need to be clearer on whether the delivery truck has to return to base (making it a round trip), or not. If the truck does return, then a simple sort does not produce the shortest route, because the square of the return from the furthest point to base costs so much. Missing some hops on the way 'out' and using them on the way back turns out to be cheaper.
If you trick someone into a bad answer (for example, by not giving them all the information) then is it their foolishness or your deception that has caused it?
How great is the wisdom of the wise, if they heed not their ego's lies?

Resources