Finding contained bordered regions from Excel imports - algorithm

I am importing massive amounts of data from Excel that have various table layouts. I have good enough table detection routines and merge cell handling, but I am running into a problem when it comes to dealing with borders. Namely performance. The bordered regions in some of these files have meaning.
Data Setup:
I am importing directly from Office Open XML using VB6 and MSXML. The data is parsed from the XML into a dictionary of cell data. This wonks wonderfully and is just as fast as using docmd.transferspreadsheet in Access, but returns much better results. Each cell contains a pointer to a style element which contains a pointer to a border element that defines the visibility and weight of each border (this is how the data is structured inside OpenXML, also).
Challenge:
What I'm trying to do is find every region that is enclosed inside borders, and create a list of cells that are inside that region.
What I have done:
I initially created a BFS(breadth first search) fill routine to find these areas. This works wonderfully and fast for "normal" sized spreadsheets, but gets way too slow for imports into the thousands of rows. One problem is that a border in Excel could be stored in the cell you are checking or the opposing border in the adjacent cell. That's ok, I can consolidate that data on import to reduce the number of checks needed.
One thing I thought about doing is to create a separate graph that outlines the cells using the borders as my edges and using a graph algorithm to find regions that way, but I'm having trouble figuring out how to implement the algorithm. I've used Dijkstra in the past and thought I could do similar with this. So I can span out using no endpoint to search the entire graph, and if I encounter a closed node I know that I just found an enclosed region, but how can I know if the route I've found is the optimal one? I guess I could flag that to run a separate check for the found closed node to the previous node ignoring that one edge.
This could work, but wouldn't be much better performance wise on dense graphs. Can anyone else suggest a better method? Thanks for taking the time to read this.

Your question is pretty complicated, but it sounds as though you need an algorithm to find the connected components of a graph (connected component = set of nodes all connected to one another but to no other nodes), which can be accomplished in linear time by repeated traversals. Pseudocode:
FindComponents(G):
For all vertices v in G:
Let C be a mutable empty collection
Traverse(G, C, v)
If C is nonempty, then it is a connected component
Traverse(G, C, v):
If v has not been visited:
Mark v as visited
Add v to C
For each neighbor w of v in G:
Traverse(G, C, w)
Iterative variant of Traverse:
Traverse(G, C, r):
Let S be an empty stack
Push r onto S
While S is not empty:
Pop the top element v of S
If v is not marked as visited:
Mark v as visited
Add v to C
For each neighbor w of v in G:
Push w onto S

Related

Finding Contiguous Areas of Bits in 2D Bit Array

The Problem
I have a bit array which represents a 2-dimensional "map" of "tiles". This image provides a graphical example of the bits in the bit array:
I need to determine how many contiguous "areas" of bits exist in the array. In the example above, there are two such contiguous "areas", as illustrated here:
Tiles must be located directly N, S, E or W of a tile to be considered "contiguous". Diagonally-touching tiles do not count.
What I've Got So Far
Because these bit arrays can become relatively large (several MB in size), I have intentionally avoided using any sort of recursion in my algorithm.
The pseudo-code is as follows:
LET S BE SOURCE DATA ARRAY
LET C BE ARRAY OF IDENTICAL LENGTH TO SOURCE DATA USED TO TRACK "CHECKED" BITS
FOREACH INDEX I IN S
IF C[I] THEN
CONTINUE
ELSE
SET C[I]
IF S[I] THEN
EXTRACT_AREA(S, C, I)
EXTRACT_AREA(S, C, I):
LET T BE TARGET DATA ARRAY FOR STORING BITS OF THE AREA WE'RE EXTRACTING
LET F BE STACK OF TILES TO SEARCH NEXT
PUSH I UNTO F
SET T[I]
WHILE F IS NOT EMPTY
LET X = POP FROM F
IF C[X] THEN
CONTINUE
ELSE
SET C[X]
IF S[X] THEN
PUSH TILE NORTH OF X TO F
PUSH TILE SOUTH OF X TO F
PUSH TILE WEST OF X TO F
PUSH TILE EAST OF X TO F
SET T[X]
RETURN T
What I Don't Like About My Solution
Just to run, it requires two times the memory of the bitmap array it's processing.
While extracting an "area", it uses three times the memory of the bitmap array.
Duplicates exist in the "tiles to check" stack - which seems ugly, but not worth avoiding the way I have things now.
What I'd Like To See
Better memory profile
Faster handling of large areas
Solution / Follow-Up
I re-wrote the solution to explore the edges only (per #hatchet 's suggestion).
This was very simple to implement - and eliminated the need to keep track of "visited tiles" completely.
Based on three simple rules, I can traverse the edges, track min/max x & y values, and complete when I've arrived at the start again.
Here's the demo with the three rules I used:
One approach would be a perimeter walk.
Given a starting point anywhere along the edge of the shape, remember that point.
Start the bounding box as just that point.
Walk the perimeter using a clockwise rule set - if the point used to get to the current point was above, then first look right, then down, then left to find the next point on the shape perimeter. This is kind of like the simple strategy of solving a maze where you continuously follow a wall and always bear to the right.
Each time you visit a new perimeter point, expand the bounding box if the new point is outside it (i.e. keep track of the min and max x and y.
Continue until the starting point is reached.
Cons: if the shape has lots of single pixel 'filaments', you'll be revisiting them as the walk comes back.
Pros: if the shape has large expanses of internal occupied space, you never have to visit them or record them like you would if you were recording visited pixels in a flood fill.
So, conserves space, but in some cases at the expense of time.
Edit
As is so often the case, this problem is known, named, and has multiple algorithmic solutions. The problem you described is called Minimum Bounding Rectangle. One way to solve this is using Contour Tracing. The method I described above is in that class, and is called Moore-Neighbor Tracing or Radial Sweep. The links I've included for them discuss them in detail and point out a problem I hadn't caught. Sometimes you'll revisit the start point before you have traversed the entire perimeter. If your start point is for instance somewhere along a single pixel 'filament', you will revisit it before you're done, and unless you consider this possibility, you'll stop prematurely. The website I linked to talks about ways to address this stopping problem. Other pages at that website also talk about two other algorithms: Square Tracing, and Theo Pavlidis's Algorithm. One thing to note is that these consider diagonals as contiguous, whereas you don't, but that should just something that can be handled with minor modifications to the basic algorithms.
An alternative approach to your problem is Connected-component labeling. For your needs though, this may be a more time expensive solution than you require.
Additional resource:
Moore Neighbor Contour Tracing Algorithm in C++
I actually got a question like this in an interview once.
You can pretend the array is a graph and the connected nodes are the adjacent ones. My algo would involves going 1 to the right until you find a marked node. When you find one do a breadth first search which runs in O(n) and avoids recursion. When the BFS returns keep searching from where you left off and if the node has already been marked by one of the previous BFS's you obviously don't need to search. I wasn't sure if you wanted to actually return the number of objects found, but it's easy to keep track by just incrementing a counter when you hit the first marked square.
Generally when you do a flood fill type algorithm you are placed in a spot and asked to fill. Since this is finding all the filled regions one way you would want to optimize it is to avoid rechecking the already marked nodes from previous BFS's, unfortunately at the moment I cannot think of a way to do that.
One hacky way to reduce memory consumption would be too store a short[][] instead of a boolean. Then use this scheme to avoid making a whole second 2d-array
unmarked = 0, marked = 1, checked and unmarked = 3, checked and marked = 3
This way you can check the status of an entry by its value and avoid making a second array.

How to find the neighbors of a graph effiiciently

I have a program that create graphs as shown below
The algorithm starts at the green color node and traverses the graph. Assume that a node (Linked list type node with 4 references Left, Right, Up and Down) has been added to the graph depicted by the red dot in the image. Inorder to integrate the newly created node with it neighbors I need to find the four objects and link it so the graph connectivity will be preserved.
Following is what I need to clarify
Assume that all yellow colored nodes are null and I do not keep a another data structure to map nodes what is the most efficient way to find the existence of the neighbors of the newly created node. I know the basic graph search algorithms like DFS, BFS etc and shortest path algorithms but I do not think any of these are efficient enough because the graph can have about 10000 nodes and doing graph search algorithms (starting from the green node) to find the neighbors when a new node is added seems computationally expensive to me.
If the graph search is not avoidable what is the best alternative structure. I thought of a large multi-dimensional array. However, this has memory wastage and also has the issue of not having negative indexes. Since the graph in the image can grow in any directions. My solution to this is to write a separate class that consists of a array based data structure to portray negative indexes. However, before taking this option I would like to know if I could still solve the problem without resolving to a new structure and save a lot of rework.
Thank you for any feedback and reading this question.
I'm not sure if I understand you correctly. Do you want to
Check that there is a path from (0,0) to (x1,y1)
or
Check if any of the neighbors of (x1,y1) are in the graph? (even if there is no path from (0,0) to any of this neighbors).
I assume that you are looking for a path (otherwise you won't use a linked-list), which implies that you can't store points which have no path to (0,0).
Also, you mentioned that you don't want to use any other data structure beside / instead of your 2D linked-list.
You can't avoid full graph search. BFS and DFS are the classic algorithms. I don't think that you care about the shortest path - any path would do.
Another approaches you may consider is A* (simple explanation here) or one of its variants (look here).
An alternative data structure would be a set of nodes (each node is a pair < x,y > of course). You can easily run 4 checks to see if any of its neighbors are already in the set. It would take O(n) space and O(logn) time for both check and add. If your programming language does not support pairs as nodes of a set, you can use a single integer (x*(Ymax+1) + Y) instead.
Your data structure can be made to work, but probably not efficiently. And it will be a lot of work.
With your current data structure you can use an A* search (see https://en.wikipedia.org/wiki/A*_search_algorithm for a basic description) to find a path to the point, which necessarily finds a neighbor. Then pretend that you've got a little guy at that point, put his right hand on the wall, then have him find his way clockwise around the point. When he gets back, he'll have found the rest.
What do I mean by find his way clockwise? For example suppose that you go Down from the neighbor to get to his point. Then your guy should be faced the first of Right, Up, and Left which he has a neighbor. If he can go Right, he will, then he will try the directions Down, Right, Up, and Left. (Just imagine trying to walk through the maze yourself with your right hand on the wall.)
This way lies insanity.
Here are two alternative data structures that are much easier to work with.
You can use a quadtree. See http://en.wikipedia.org/wiki/Quadtree for a description. With this inserting a node is logarithmic in time. Finding neighbors is also logarithmic. And you're only using space for the data you have, so even if your graph is very spread out this is memory efficient.
Alternately you can create a class for a type of array that takes both positive and negative indices. Then one that builds on that to be 2-d class that takes both positive and negative indices. Under the hood that class would be implemented as a regular array and an offset. So an array that can start at some number, positive or negative. If ever you try to insert a piece of data that is before the offset, you create a new offset that is below that piece by a fixed fraction of the length of the array, create a new array, and copy data from the old to the new. Now insert/finding neighbors are usually O(1) but it can be very wasteful of memory.
You can use a spatial index like a quad tree or a r-tree.

Algorithm for falling grid items

I do not know how to describe the goal succinctly, which may be why I haven't been able to find an applicable algorithm despite ample searching, but a picture shows it clearly:
Given the state of items in the grid at the left, does anyone know of an algorithm for efficiently finding the ending positions shown in the grid at right? In this case all the items have "fallen" "down", but the direction of course is arbitrary. The point is just that:
There are a collection of items of arbitrary shapes, but all composed of contiguous squares
Items cannot overlap
All items should move the maximum distance in a given direction until they are touching a wall, or they are touching another item which [...is touching another item ad infinitum...] is touching a wall.
This is not homework, I'm not a student. This is for my own interest in geometry and programming. I haven't mentioned the language because it doesn't matter. I can implement whatever algorithm in the language I'm using for the specific project I'm working on. A useful answer could be described in words or code; it's the ideas that matter.
This problem could probably be abstracted into some kind of graph (in the mathematical sense) of dependencies and slack space, so perhaps an algorithm aimed at minimizing lag time could be adapted.
If you don't know the answer but are about to try to make up an algorithm on the spot, just remember that there can be circular dependencies, such as with the interlocking pink (backwards) "C" and blue "T" shapes. Parts of T are below C, and parts of C are below T. This would be even more tricky if interlocking items were locked through a "loop" of several pieces.
Some notes for an applicable algorithm: All the following are very easy and fast to do because of the way I've built the grid object management framework:
Enumerate the individual squares within a piece
Enumerate all pieces
Find the piece, if any, occupying a specific square in the overall grid
A note on the answer:
maniek hinted it first, but bloops has provided a brilliant explanation. I think the absolute key is the insight that all pieces moving the same amount maintain their relationship to each other, and therefore those relationships don't have to be considered.
An additional speed-up for a sparsely populated board would be to shift all pieces to eliminate rows that are completely empty. It is very easy to count empty rows and to identify pieces on one side ("above") an empty row.
Last note: I did in fact implement the algorithm described by bloops, with a few implementation-specific modifications. It works beautifully.
The Idea
Define the set of frozen objects inductively as follows:
An object touching the bottom is frozen.
An object lying on a frozen object is frozen.
Intuitively, exactly the frozen objects have reached their final place. Call the non-frozen objects active.
Claim: All active objects can fall one unit downwards simultaneously.
Proof: Of course, an active object will not hit another active object, since their relative position with respect to each other does not change. An active object will also not hit a frozen object. If that was so, then the active object was, in fact, frozen, because it was lying on a a frozen object. This contradicts our assumption.
Our algorithm's very high-level pseudo-code would be as follows:
while (there are active objects):
move active objects downwards simultaneously until one of them hits a frozen object
update the status (active/frozen) of each object
Notice that at least one object becomes frozen in each iteration of the while loop. Also, every object becomes frozen exactly once. These observations would be used while analyzing the run-time complexity of the actual algorithm.
The Algorithm
We use the concept of time to improve the efficiency of most operations. The time is measured starting from 0, and every unit movement of the active objects take 1 unit time. Observe that, when we are at time t, the displacement of all the objects currently active at time t, is exactly t units downward.
Note that in each column, the relative ordering of each cell is fixed. One of the implications of this is that each cell can directly stop at most one other cell from falling. This observation could be used to efficiently predict the time of the next collision. We can also get away with 'processing' every cell at most once.
We index the columns starting from 1 and increasing from left to right; and the rows with height starting from 1. For ease of implementation, introduce a new object called bottom - which is the only object which is initially frozen and consists of all the cells at height 0.
Data Structures
For an efficient implementation, we maintain the following data structures:
An associative array A containing the final displacement of each cell. If a cell is active, its entry should be, say, -1.
For each column k, we maintain the set S_k of the initial row numbers of active cells in column k. We need to be able to support successor queries and deletions on this set. We could use a Van Emde Boas tree, and answer every query in O(log log H); where H is the height of the grid. Or, we could use a balanced binary search tree which can perform these operations in O(log N); where N is the number of cells in column k.
A priority queue Q which will store the the active cells with its key as the expected time of its future collision. Again, we could go for either a vEB tree for a query time of O(log log H) or a priority queue with O(log N) time per operation.
Implementation
The detailed pseudo-code of the algorithm follows:
Populate the S_k's with active cells
Initialize Q to be an empty priority queue
For every cell b in bottom:
Push Q[b] = 0
while (Q is not empty):
(x,t) = Q.extract_min() // the active cell x collides at time t
Object O = parent_object(x)
For every cell y in O:
A[y] = t // freeze cell y at displacement t
k = column(y)
S_k.delete(y)
a = S_k.successor(y) // find the only active cell that can collide with y
if(a != nil):
// find the expected time of the collision between a and y
// note that both their positions are currently t + (their original height)
coll_t = t + height(a) - height(y) - 1
Push/update Q[a] = coll_t
The final position of any object can be obtained by querying A for the displacement of any cell belonging to that object.
Running Time
We process and freeze every cell exactly once. We perform a constant number of lookups while freezing every cell. We assume parent_object lookup can be done in constant time. The complexity of the whole algorithm is O(N log N) or O(N log log H) depending on the data structures we use. Here, N is the total number of cells across all objects.
And now something completely different :)
Each piece that rests on the ground is fixed. Each piece that rests on a fixed piece is fixed. The rest can move. Move the unfixed pieces 1 square down, repeat until nothing can move.
Okay so this appears to be as follows -
we proceed in step
in each step we build a directed graph whose vertices are the object set and whose edges are as follows =>
1) if x and y are two objects then add an edge x->y if x cannot move until y moves. Note that we can have both x->y and y->x edges.
2) further there are objects which can no longer move since they are at the bottom so we colour their vertices as blue. The remaining vertices are red.
2) in the directed graph we find all the strongly connected components using Kosaraju's algorithm/Tarjan's algorithm etc. (In case you are not familiar with SCC then they are extremely powerful technique and you can refer to Kosaraju's algorithm.) Once we get the SCCs we shrink them to a small vertex i.e. we replace the SCC by a single vertex while preserving all external(to SCC) edges. Now if any of the vertex in SCC is blue then we color the new vertex as blue else it is red. This implies that if one object cannot move in SCC then none can.
3) the graph you get will be a directed acyclic graph and you can do a topological sort. Traverse the vertex in increasing order of top numbering and as long as you see a red colour vertex and move the objects represented by the vertex.
continue this step until you can no longer move any vertex in step 3.
If two objects A and B overlap then we say that they are inconsistent relative to each other. For proof of correctness argue the following lemmas
1) "if I move an SCC then none of the object in it cause inconsistency among themselves."
2) "when I move an object in step 3 then I do not cause inconsistencies"
The challenge for you now will be to formally prove the correctness and find suitable data structures to solve it in efficient. Let me know if you need any help.
I haven't cleared all the details, but I think the following seems like a somewhat systematic approach:
Looking at the whole picture as a graph, what you need is a topologic sort of all the vertices - i.e. the items. Item A should be before item B in the sorting if any part of A is below any part of B. Then, when you have the items sorted topologically, you can just iterate through them in this order and determine the positions - as all the items below the current one already have fixed positions.
In order to be able to do the topologic sort, you need an acyclic graph. You can then use some of the algorithms for Strongly Connected Components to find all the cycles and compress them into single vertices. Then you can perform a topsort on the resulting graph.
To find the positions of the pieces within a SCC: first consider it as one big piece and determine where it will end up. This will determine some fixed pieces that cannot move anymore. Remove them and repeat the same procedure for the rest of the pieces in this SCC (if any) to find out their final positions.
The third part is the only one that seems computationally intensive - if you have a very complicated structure for the pieces, but it should still be more optimal than trying to move the pieces one square of the grid at a time.
EDITED several times. I think this is all you need to do:
Find all the pieces that can only fall mutually dependent on each other and combine them into an equivalent larger piece (e.g. the T and backwards C in your picture.)
Iterate through all the pieces, moving them the maximum direction down before they hit something. Repeat until nothing moves.

Combinatorial optimization

Suppose we have a connected and undirected graph: G=(V,E).
Definition of connected-set: a group of points belonging to V of G forms a valid connected-set iff every point in this group is within T-1 edges away from any other point in the same group, T is the number of points in the group.
Pls note that a connected set is just a connected subgraph of G without the edges but with the points.
And we have an arbitrary function F defined on connected-set, i.e given an arbitrary connected-set CS F(CS) will give us a real value.
Two connected-sets are said disjoint if their union is not a connected set.
For an visual explanation, pls see the graph below:
In the graph, the red,black,green point sets are all valid connected-sets, green set is disjoint to red set, but black set is not disjoint to the red one.
Now the question:
We want to find a bunch of disjoint connected-sets from G so that:
(1)every connected-set has at least K points. (K is a global parameter).
(2)the sum of their function values,i.e max(Σ F(CS)) are maximized.
Is there any efficient algorithm to tackle such a problem other than an exhaustive search?
Thx!
For example, the graph can be a planar graph in the 2D Euclidean plane, and the function value F of a connected-set CS can be defined as the area of the minimum bounding rectangle of all the points in CS(minimum bounding rectangle is the smallest rectangle enclosing all the points in the CS).
If you can define your function and prove it is a Submodular Function (property analogous to that of Convexity in continuous Optimization) then there are very efficient (strongly polynomial) algorithms that will solve your problem e.g. Minimum Norm Point.
To prove that your function is Submodular you only need to prove the following:
There are several available implementations of the Minimum Norm Point algorithm e.g. Matlab Toolbox for Submodular Function Optimization
I doubt there is an efficient algorithm since for a complete graph for instance, you cannot solve the problem without knowing the value of F on every subgraph (except if you have assumptions on F: monotonicity for instance).
Nevertheless, I'd go for a non deterministic algorithm. Try simulated annealing, with transitions being:
Remove a point from a set (if it stays connected)
Move a point from a set to another (if they stay connected)
Remove a set
Add a set with one point
Good luck, this seems to be a difficult problem.
For such a general F, it is not an easy task to draft an optimized algorithm, far from the brute force approach.
For instance, since we want to find a bunch of CS where F(CS) is maximized, should we assume we want actually to find max(Σ F(CS)) for all CS or the highest F value from all possible CS, max(F(csi))? We don't know for sure.
Also, F being arbitrary, we cannot estimate the probability of having F(cs+p1) > F(cs) => F(cs+p1+p2) > F(cs).
However, we can still discuss it:
It seems we can deduce from the problem that we can treat each CS independently, meaning if n = F(cs1) adding any cs2 (being disjoint from cs1) will have no impact on the n value.
It seems also believable, and this is where we should be able to get some gain, that the calculation of F can be made starting from any point of a CS, and, in general, if CS = cs1+cs2, F(CS) = F(cs1+cs2) = F(cs2+cs1).
Then we want to inject memoization in the algorithm in order to speed up the process when a CS is grown up little by little in order to find max(F(cs)) [considering F general, the dynamic programming approach, for instance starting from a CS made of all points, then reducing it little by little, doesn't seem to have a big interest].
Ideally, we could start with a CS made of a point, extending it by one, checking and storing F values (for each subset). Each test would first check if the F value exists in order not to calculate it ; then repeat the process for another point etc..., find the best subsets that maximize F. For a large number of points, this is a very lengthy experience.
A more reasonable approach would be to try random points and grow the CS up to a given size, then try another area distinct from the bigger CS obtained at the previous stage. One could try to assess the probability explained above, and direct the algorithm in a certain way depending on the result.
But, again due to lack of F properties, we can expect an exponential space need via memoization (like storing F(p1,...,pn), for all subsets). And an exponential complexity.
I would use dynamic programming. You can start out rephrasing your problem as a node coloring problem:
Your goal is to assign a color to each node. (In other words you are looking for a coloring of the nodes)
The available colors are black and white.
In order to judge a coloring you have to examine the set of "maximal connected sets of black nodes".
A set of black nodes is called connected if the induced subgraph is connected
A connected set of black nodes is called maximal none of the nodes in the set has a black neighbor in the original graph that is not contained in the set)
Your goal is to find the coloring that maximizes ΣF(CS). (Here you sum over the "maximal connected sets of black nodes")
You have some extra constraints are specified in your original post.
Perhaps you could look for an algorithm that does something like the following
Pick a node
Try to color the chosen node white
Look for a coloring of the remaining nodes that maximizes ΣF(CS)
Try to color the chosen node black
Look for a coloring of the remaining nodes that maximizes ΣF(CS)
Each time you have colored a node white then you can examine whether or not the graph has become "decomposable" (I made up this word. It is not official):
A partially colored graph is called "decomposable" if it contains a pair of none-white nodes that are not connected by any path that does not contain a white node.
If your partially colored graph is decomposable then you can split your problem in to two sub-problems.
EDIT: I added an alternative idea and deleted it again. :)

How can I find islands in a randomly generated hexagonal map?

I'm programming a Risk like game in Codigniter and JQuery. I've come up with a way to create randomly generated maps by making a full layout of tiles then deleting random ones. However, this sometimes produces what I call islands.
In risk, you can only attack one space over. So if one player happens to have an island all to them self, they would never be able to loose.
I'm trying to find a way that I can check the map before the game begins to see if it has islands.
I have already come up with a function to find out how many adjacent spaces there are to each space, but am not sure how to implement it in order to find islands.
Each missing spot is also identified as "water."
I'm not allowed to use image tags:
http://imgur.com/xwWzC.gif
There's a standard name for this problem but off the top of my head the following might work:
Pick any tile at random
Color it
Color its neighbours
Color its neighbours' neighbours
Color its neighbours' neighbours' neighbours, etc.
When you're done (i.e. when all neighbours are colored), loop through the list of all tiles to see whether there are any still/left uncolored (if so, they're an island).
How do you do the random generation? Probably the best way is to solve it at this time. When you're generating the map, if you notice that you just created is impossible to get to, you can resolve it by adding the appropriate element.
Though we'll need to know how you do the generation.
Here's your basic depth-first traversal starting at a random tile, pseudo-coded in python-like language:
visited = set()
queue = Queue()
r = random tile
queue.add(r)
while not queue.empty():
current = queue.pop()
visited.add(current)
for neighbor in current.neighbors():
if neighbor not in visited:
queue.add(neighbor)
if visited == set(all tiles):
print "No islands"
else:
print "Island starting at ", r
This hopefully provides another solution. Instead of "island" I'm using the term "disconnected component" since it only matters whether all tiles are reachable from all others (if there are disconnected components then a player cannot win via conquest if his own territories are all in one component).
Iterate over all 'land' tiles (easy enough to do) and for each tile generate a node in a graph.
For each vertex, join it with an undirected edge to the vertices representing its neighbour tiles (maximum of 6).
Pick any vertex and run depth-first search (or bread-first) from it.
If the set of vertices found by the DFS is equal to the set of all vertices then there are no disconnected components, otherwise a disconnected component (island) exists.
This should (I think) run in time O(n) where n is the number of land tiles.
Run a blurring kernel over your data set.
treating the hex grid as an image ( it is , sort of)
value(x,y) = average of all tiles around this (x,y)
this will erode beaches slightly, and eliminate islands.
It is left as an exercise for the student to run an edge-detection kernel over the resulting dataset to populate the beach tiles.

Resources