Bidirectional search or not? Speed considerations - algorithm

I'm implementing an algorithm which has to quickly decide whether a path exists between two cells in a 2D grid (for a maze-like game). It does not actually have to provide the path. This algorithm is run many thousands of times, so it must be fast.
The quirk is, the two cells are very close to each other (within a Manhattan distance of 2), so for most reasonable mazes, the path is often trivial. Right now I have pure breadth-first search, but I'm considering implementing a bidirectional variant. The problem is, of course, that in the cases a path does not exist, the bidirectional search will fail slower, because it searches two connected components instead of one, though if a path exists, it will find it faster (probably).
So my question is, does anyone have any experiences with bidirectional search and how it behaves in the cases mentioned above? Is the speed difference actually quite marginal?

The intuition that if no path exists, bidirectional search [1] does more job than unidirectional, does not generally hold. If your bidirectional algorithm is coded to alternate frequently between expanding nodes from forward and backward search (as it should do), there is a chance that bidirectional variant returns before the unidirectional does even in the case there is not path between source and target: Suppose that the input graph contains 2 components that are not connected, say, V and W; source node s belonging to V, target node belonging to W; |V| = 1000 and |W| = 10. Now the unidirectional search will have to expand all 1000 nodes before its priority queue runs empty. In bidirectional search, only 10 nodes from W and 10 nodes from V will be expanded, then it terminates.
[1] Java implementation

the maze is slightly different each time (a different cell is made non-passable each time)
in that case you can often do better by saving you flood-fill (breadth first) distances.
consider a maze like this (from + to *)
XXXXXXX
X+ *X
X XXX X
X X
XXXXXXX
which has flood fill distances
XXXXXXX
X+123*X
X1XXX7X
X23456X
XXXXXXX
blocking point Z gives
XXXXXXX
X+123*X
X1XXX7X
X23Z56X
XXXXXXX
and since the value at Z was 4, which is larger than the shortest path (3), you immediately know that Z does not affect the solution, with no further searching.
the other case, if you block at Y,
XXXXXXX
X+1Y3*X
X1XXX7X
X23456X
XXXXXXX
you know that any distance greater than 2 (the blocked value) is unreliable, and so you need to recalculate those points. in this case, that means repeating the search on the longer path. but that is no more expensive than you were doing anyway.
in short, if you are making small modifications, storing the flood-fill distances can save time (at the cost of memory).
this is only very general advice. i am not saying that it is always best to completely flood fill every cell when starting, for example. it may be that stopping on first success makes more sense, with further filling occurring later.
in other words, cache internal results during the search and be smart about invalidating the cache. then you can avoid the cost of duplicating work in areas of the maze that have not changed.

I implemented one of these and it almost doubled my search times. Instead of using a queue version of bfs in this bidirectional search i used the version that is taught by Erik D. in his MIT classes but i don't see how the queue version would make that much of a difference???.
Another way that is fast is with link-cut trees. They are forests of usually splay trees and are used with dynamic graphs.

Related

Algorithms for Deducing a Timeline / Chronology

I'm looking for leads on algorithms to deduce the timeline/chronology of a series of novels. I've split the texts into days and created a database of relationships between them, e.g.: X is a month before Y, Y and Z are consecutive, date of Z is known, X is on a Tuesday, etc. There is uncertainty ('month' really only means roughly 30 days) and also contradictions. I can mark some relationships as more reliable than others to help resolve ambiguity and contradictions.
What kind of algorithms exist to deduce a best-fit chronology from this kind of data, assigning a highest-probability date to each day? At least time is 1-dimensional but dealing with a complex relationship graph with inconsistencies seems non-trivial. I have a CS background so I can code something up but some idea about the names of applicable algorithms would be helpful. I guess what I have is a graph with days as nodes as relationships as edges.
A simple, crude first approximation to your problem would be to store information like "A happened before B" in a directed graph with edges like "A -> B". Test the graph to see whether it is a Directed Acyclic Graph (DAG). If it is, the information is consistent in the sense that there is a consistent chronology of what happened before what else. You can get a sample linear chronology by printing a "topological sort" (topsort) of the DAG. If events C and D happened simultaneously or there is no information to say which came before the other, they might appear in the topsort as ABCD or ABDC. You can even get the topsort algorithm to print all possibilities (so both ABCD and ABDC) for further analysis using more detailed information.
If the graph you obtain is not a DAG, you can use an algorithm like Tarjan's algorithm to quickly identify "strongly connected components", which are areas of the graph which contain chronological contradictions in the form of cycles. You could then analyze them more closely to determine which less reliable edges might be removed to resolve contradictions. Another way to identify edges to remove to eliminate cycles is to search for "minimum feedback arc sets". That's NP-hard in general but if your strongly connected components are small the search could be feasible.
Constraint programming is what you need. In propagation-based CP, you alternate between (a) making a decision at the current choice point in the search tree and (b) propagating the consequences of that decision as far as you can. Notionally you do this by maintaining a domain D of possible values for each problem variable x such that D(x) is the set of values for x which have not yet been ruled out along the current search path. In your problem, you might be able to reduce it to a large set of Boolean variables, x_ij, where x_ij is true iff event i precedes event j. Initially D(x) = {true, false} for all variables. A decision is simply reducing the domain of an undecided variable (for a Boolean variable this means reducing its domain to a single value, true or false, which is the same as an assignment). If at any point along a search path D(x) becomes empty for any x, you have reached a dead-end and have to backtrack.
If you're smart, you will try to learn from each failure and also retreat as far back up the search tree as required to avoid redundant search (this is called backjumping -- for example, if you identify that the dead-end you reached at level 7 was caused by the choice you made at level 3, there's no point in backtracking just to level 6 because no solution exists in this subtree given the choice you made at level 3!).
Now, given you have different degrees of confidence in your data, you actually have an optimisation problem. That is, you're not just looking for a solution that satisfies all the constraints that must be true, but one which also best satisfies the other "soft" constraints according to the degree of trust you have in them. What you need to do here is decide on an objective function assigning a score to a given set of satisfied/violated partial constraints. You then want to prune your search whenever you find the current search path cannot improve on the best previously found solution.
If you do decide to go for the Boolean approach, you could profitably look into SAT solvers, which tear through these kinds of problems. But the first place I'd look is at MiniZinc, a CP language which maps on to a whole variety of state of the art constraint solvers.
Best of luck!

Algorithm for Connect 4 Evaluation of Data Set

I am working on a connect 4 AI, and saw many people were using this data set, containing all the legal positions at 8 ply, and their eventual outcome.
I am using a standard minimax with alpha/beta pruning as my search algorithm. It seems like this data set could could be really useful for my AI. However, I'm trying to find the best way to implement it. I thought the best approach might be to process the list, and use the board state as a hash for the eventual result (win, loss, draw).
What is the best way for to design an AI to use a data set like this? Is my idea of hashing the board state, and using it in a traditional search algorithm (eg. minimax) on the right track? or is there is better way?
Update: I ended up converting the large move database to a plain test format, where 1 represented X and -1 O. Then I used a string of the board state, an an integer representing the eventual outcome, and put it in an std::unsorted_map (see Stack Overflow With Unordered Map to for a problem I ran into). The performance of the map was excellent. It built quickly, and the lookups were fast. However, I never quite got the search right. Is the right way to approach the problem to just search the database when the number of turns in the game is less than 8, then switch over to a regular alpha-beta?
Your approach seems correct.
For the first 8 moves, use alpha-beta algorithm, and use the look-up table to evaluate the value of each node at depth 8.
Once you have "exhausted" the table (exceeded 8 moves in the game) - you should switch to regular alpha-beta algorithm, that ends with terminal states (leaves in the game tree).
This is extremely helpful because:
Remember that the complexity of searching the tree is O(B^d) - where B is the branch factor (number of possible moves per state) and d is the needed depth until the end.
By using this approach you effectively decrease both B and d for the maximal waiting times (longest moves needed to be calculated) because:
Your maximal depth shrinks significantly to d-8 (only for the last moves), effectively decreasing d!
The branch factor itself tends to shrink in this game after a few moves (many moves become impossible or leading to defeat and should not be explored), this decreases B.
In the first move, you shrink the number of developed nodes as well
to B^8 instead of B^d.
So, because of these - the maximal waiting time decreases significantly by using this approach.
Also note: If you find the optimization not enough - you can always expand your look up table (to 9,10,... first moves), of course it will increase the needed space exponentially - this is a tradeoff you need to examine and chose what best serves your needs (maybe even store the entire game in file system if the main memory is not enough should be considered)

good way to check the edge exists or not in an undirected graph

I am using adjacency list representation.
basically
A:[B,C,D] means A is connected to B,C and D
now I am trying to add a method (in python) to add edge in graph.
But before I add an edge. I want to check whether two edges are connected or not.
So for example I want to add an edge between two nodes D and A ( ignorant of the fact taht A and D are connected).
So, since there is no key "D" in the hash/dictionary, it will return false.
Now, very naively, I can check for D and A and then A and D as well.. but thats very scruff.
Or whenever I connect two nodes, I can always duplicate..
I.e when connecting A and E.. A:[E] create E:[A]
but this is not very space efficient.
Basically I want to make this graph direction independent.
Is there any data structure that can help me solve this.
I am hoping that my question makes sense.
For an undirected graph you could use a simple edge list in which you store all the pairs of edges. This will save space and worsen performance but you should know that you can't have both at the same time so you always have to decide for a tradeoff.
Otherwise you could use a triangular adjacency matrix but, to avoid wasting half of the space, you will have to store it in a particular way (by developing an efficient way to retrieve edge existence without wasting space). Are you sure it is worth it and it's not just premature optimization?
Adjacency lists are mostly fine, even if you have to store every undirected edge twice, how big is your graph?
Take a look at this my answer: Graph representation benchmarking, so you can choose which one you prefer.
You have run into a classic space vs. time tradeoff.
As you said, if you don't find D->A, you can search for A->D. This will result in maximum of double your execution time. Alternatively, when inserting A->D, also create D->A, but this comes at the cost of additional space.
Worst case, for the time tradeoff, you will do 2 lookups, which is still O(N) (faster with better data structures). For the space tradeoff, you will (in the worst case) create a link between every set of nodes, which is roughly O(N^2). As such, I would just do 2 lookups.
Assuming each contains() method is VERY expansive, and you want to avoid doing these in all costs, one can use a bloom filter, and check if an edge exists - and by this, reduce the number of contains() calls.
The idea is: Each node will hold its own bloom filter, which will indicate which edges are connected to it. Checking a bloom filter is fairly easy and cheap, and also modifying it when an edge is added.
If you checked the bloom filter - and it said "no" - you can safely add the edge - it does not exist.
However, bloom filters have False Positives - so, if the bloom filter said "the edge exists" - you will have to check the list if it is indeed there.
Notes:
(1) Removing edges will be a problem if using bloom filters.
(2) Bloom filters give you nice time/space trade off - as the number of false positives decrease as the size of the filter grows.
(3) However, when an edge does exist - no matter what's the size of the filter, you will alwats have to use the contains() method.
Assuming your node names can be compared, you can simply always store the edges so that the first endpoint is less than the second endpoint. Then you only have one lookup to perform. This definitely works for strings.

Optimal selection election algorithm

Given a bunch of sets of people (similar to):
[p1,p2,p3]
[p2,p3]
[p1]
[p1]
Select 1 from each set, trying to minimize the maximum number of times any one person is selected.
For the sets above, the max number of times a given person MUST be selected is 2.
I'm struggling to get an algorithm for this. I don't think it can be done with a greedy algorithm, more thinking along the lines of a dynamic programming solution.
Any hints on how to go about this? Or do any of you know any good websites about this stuff that I could have a look at?
This is neither dynamic nor greedy. Let's look at a different problem first -- can it be done by selecting every person at most once?
You have P people and S sets. Create a graph with S+P vertices, representing sets and people. There is an edge between person pi and set si iff pi is an element of si. This is a bipartite graph and the decision version of your problem is then equivalent to testing whether the maximum cardinality matching in that graph has size S.
As detailed on that page, this problem can be solved by using a maximum flow algorithm (note: if you don't know what I'm talking about, then take your time to read it now, as you won't understand the rest otherwise): first create a super-source, add an edge linking it to all people with capacity 1 (representing that each person may only be used once), then create a super-sink and add edges linking every set to that sink with capacity 1 (representing that each set may only be used once) and run a suitable max-flow algorithm between source and sink.
Now, let's consider a slightly different problem: can it be done by selecting every person at most k times?
If you paid attention to the remarks in the last paragraph, you should know the answer: just change the capacity of the edges leaving the super-source to indicate that each person may be used more than once in this case.
Therefore, you now have an algorithm to solve the decision problem in which people are selected at most k times. It's easy to see that if you can do it with k, then you can also do it with any value greater than k, that is, it's a monotonic function. Therefore, you can run a binary search on the decision version of the problem, looking for the smallest k possible that still works.
Note: You could also get rid of the binary search by testing each value of k sequentially, and augmenting the residual network obtained in the last run instead of starting from scratch. However, I decided to explain the binary search version as it's conceptually simpler.

How to solve this linear programing problem?

I'm not so good at linear programing so I'm posting this problem here.
Hope somebody can point me out to the right direction.
It is not homework problem so don't misunderstand.
I have a matrix 5x5 (25 nodes). Distance between each node and its adjacent nodes (or neighbor nodes) is 1 unit. A node can be in 1 of 2 conditions: cache or access. If a node 'i' is a cache node, an access nodes 'j' can be able to access it with a cost of Dij x Aij (Access Cost). Dij is Manhattan distance between node i and j. Aij is access frequency from node i to j.
In order to become a cache node i, it needs to cache from an existing cache node k with a cost of Dik x C where C is a Integer constant. (Cache Cost) . C is called caching frequency.
A is provided as an 25x25 matrix containing all integers that shows access frequency between any pair of node i and j. D is provided as an 25x25 matrix containing all Manhattan distances between any pair of node i and j.
Assume there is 1 cache node in the matrix, find out the set of other cache nodes and access nodes such that the total cost will be minimized.
Total Cost = Total Cache Cost + Total Access Cost .
I've tackled a few problems that are something like this.
First, if you don't need an exact answer, I'd generally suggest looking into something like a genetic algorithm, or doing a greedy algorithm. It won't be right, but it won't generally be bad either. And it will be much faster than an exact algorithm. For instance you can start with all points as cache points, then find the point which reduces your cost most from making it a non-caching point. Continue until removing the next one makes the cost goes up, and use that as your solution. This won't be best. It will generally be reasonably good.
If you do need an exact answer, you will need to brute force search of a lot of data. Assuming that the initial cache point is specified, you'll have 224 = 16,777,216 possible sets of cache points to search. That is expensive.
The trick to doing it more cheaply (note, not cheaply, just more cheaply) is finding ways to prune your search. Take to heart the fact that if doing 100 times as much work on each set you look at lets you remove an average of 10 points from consideration as cache points, then your overall algorithm will visit 0.1% as many sets, and your code will run 10 times faster. Therefore it is worth putting a surprising amount of energy into pruning early and often, even if the pruning step is fairly expensive.
Often you want multiple pruning strategies. One of them is usually "the best we can do from here is worst than the best we have found previously." This works better if you've already found a pretty good best solution. Therefore it is often worth a bit of effort to do some local optimization in your search for solutions.
Typically these optimizations won't change the fact that you are doing a tremendous amount of work. But they do let you do orders of magnitude less work.
My initial try at this would take advantage of the following observations.
Suppose that x is a cache point, and y is its nearest caching neighbor. Then you can always make some path from x to y cache "for free" if you just route the cache update traffic from x to y along that path. Therefore without loss of generality the set of cache points is connected on the grid.
If the minimum cost would could wind up with exceeds the current best cost we have found, we are not on our way to a global solution.
As soon as the sum of the access rate from all points at distance greater than 1 from the cache points plus the highest access frequency of a neighbor to the cache point that you can still use is less than the cache frequency, adding more cache points is always going to be a loss. (This would be an "expensive condition that lets us stop 10 minutes early.")
The highest access neighbor of the current set of cache points is a reasonable candidate for the next cache point to try. (There are several other heuristics you can try, but this one is reasonable.)
Any point whose total access frequency exceeds the cache frequency absolutely must be a caching point.
This might not be the best set of observations to use. But it is likely to be pretty reasonable. To take advantage of this you'll need at least one data structure you might not be familiar with. If you don't know what a priority queue is, then look around for an efficient one in your language of choice. If you can't find one, a heap is pretty easy to implement and works pretty well as a priority queue.
With that in mind, assuming that you have been given the information you've described and an initial cache node P, here is pseudo-code for an algorithm to find the best.
# Data structures to be dynamically maintained:
# AT[x, n] - how many accesses x needs that currently need to go distance n.
# D[x] - The distance from x to the nearest cache node.
# CA[x] - Boolean yes/no for whether x is a cache node.
# B[x] - Boolean yes/no for whether x is blocked from being a cache node.
# cost - Current cost
# distant_accesses - The sum of the total number of accesses made from more than
# distance 1 from the cache nodes.
# best_possible_cost - C * nodes_in_cache + sum(min(total accesses, C) for non-cache nodes)
# *** Sufficient data structures to be able to unwind changes to all of the above before
# returning from recursive calls (I won't specify accesses to them, but they need to
# be there)
# best_cost - The best cost found.
# best_solution - The best solution found.
initialize all of those data structures (including best)
create neighbors priority queue of neighbors of root cache node (ordered by accesses)
call extend_current_solution(neighbors)
do what we want with the best solution
function extend_current_solution (available_neighbors):
if cost < best_cost:
best_cost = cost
best_solution = CA # current set of cache nodes.
if best_cost < best_possible_cost
return # pruning time
neighbors = clone(available_neighbors)
while neighbors:
node = remove best from neighbors
if distant_accesses + accesses(node) < C:
return # this is condition 3 above
make node in cache set
- add it to CA
- update costs
- add its immediate neighbors to neighbors
call extend_current_solution
unwind changes just made
make node in blocked set
call extend_current_solution
unwind changes to blocked set
return
It will take a lot of work to write this, and you'll need to be careful to maintain all data structures. But my bet is that - despite how heavyweight it looks - you'll find that it prunes your search space enough to run more quickly than your existing solution. (It still won't be snappy.)
Good luck!
Update
When I thought about this more, I realized that a better observation is to note that if you can cut the "not a cache node, not a blocked node" set into two pieces, then you can solve those pieces independently. Each of those sub problems is orders of magnitude faster to solve than the whole problem, so seek to do so as fast as possible.
A good heuristic to do that is to follow the following rules:
While no edge has been reached:
Drive towards the closest edge. Distance is measured by how short the shortest path is along the non-cache, non-blocked set.
If two edges are equidistant, break ties according to the following preference order: (1, x), (x, 1), (5, x), (x, 5).
Break any remaining ties according to preferring to drive towards the center of an edge.
Break any remaining ties randomly.
While an edge has been reached and your component still has edges that could become cache pieces:
If you can immediately move into an edge and split the edge pieces into two components, do so. Both for "edge in cache set" and "edge not in cache set" you'll get 2 independent subproblems that are more tractable.
Else move on a shortest path towards the piece in the middle of your section of edge pieces.
If there is a tie, break it in favor of whatever makes the line from the added piece to the added cache element as close to diagonal as possible.
Break any remaining ties randomly.
If you fall through here, choose randomly. (You should have a pretty small subproblem at this point. No need to be clever.)
If you try this starting out with (3, 3) as a cache point, you'll find that in the first few decisions you'll find that 7/16 of the time you manage to cut into two even problems, 1/16 of the time you block in the cache point and finish, 1/4 of the time you manage to cut out a 2x2 block into a separate problem (making the overall solution run 16 times faster for that piece) and 1/4 of the time you wind up well on your way towards a solution that is on its way towards either being boxed in (and quickly exhausted), or else being a candidate for a solution with a lot of cache points that gets pruned for being on track to being a bad solution.
I won't give pseudo-code for this variation. It will have a lot of similarities to what I had above, with a number of important details to handle. But I would be willing to bet money that in practice it will run orders of magnitude faster than your original solution.
The solution is a set, so this is not a linear programming problem. What it is is a special case of connected facility location. Bardossy and Raghavan have a heuristic that looks promising: http://terpconnect.umd.edu/~raghavan/preprints/confl.pdf
is spiral cache an analogy to the solution? http://strumpen.net/ibm-rc24767.pdf

Resources