Optimizing Conway's 'Game of Life' - algorithm

To experiment, I've (long ago) implemented Conway's Game of Life (and I'm aware of this related question!).
My implementation worked by keeping 2 arrays of booleans, representing the 'last state', and the 'state being updated' (the 2 arrays being swapped at each iteration). While this is reasonably fast, I've often wondered about how to optimize this.
One idea, for example, would be to precompute at iteration N the zones that could be modified at iteration (N+1) (so that if a cell does not belong to such a zone, it won't even be considered for modification at iteration (N+1)). I'm aware that this is very vague, and I never took time to go into the details...
Do you have any ideas (or experience!) of how to go about optimizing (for speed) Game of Life iterations?

I am going to quote my answer from the other question, because the chapters I mention have some very interesting and fine-tuned solutions. Some of the implementation details are in c and/or assembly, yes, but for the most part the algorithms can work in any language:
Chapters 17 and 18 of
Michael Abrash's Graphics
Programmer's Black Book are one of
the most interesting reads I have ever
had. It is a lesson in thinking
outside the box. The whole book is
great really, but the final optimized
solutions to the Game of Life are
incredible bits of programming.

There are some super-fast implementations that (from memory) represent cells of 8 or more adjacent squares as bit patterns and use that as an index into a large array of precalculated values to determine in a single machine instruction if a cell is live or dead.
Check out here:
Also XLife:

You should look into Hashlife, the ultimate optimization. It uses the quadtree approach that skinp mentioned.

As mentioned in Arbash's Black Book, one of the most simple and straight forward ways to get a huge speedup is to keep a change list.
Instead of iterating through the entire cell grid each time, keep a copy of all the cells that you change.
This will narrow down the work you have to do on each iteration.

The algorithm itself is inherently parallelizable. Using the same double-buffered method in an unoptimized CUDA kernel, I'm getting around 25ms per generation in a 4096x4096 wrapped world.

what is the most efficient algo mainly depends on the initial state.
if the majority of cells is dead, you could save a lot of CPU time by skipping empty parts and not calculating stuff cell by cell.
im my opinion it can make sense to check for completely dead spaces first, when your initial state is something like "random, but with chance for life lower than 5%."
i would just divide the matrix up into halves and start checking the bigger ones first.
so if you have a field of 10,000 * 10,000, you´d first accumulate the states of the upper left quarter of 5,000 * 5,000.
and if the sum of states is zero in the first quarter, you can ignore this first quarter completely now and check the upper right 5,000 * 5,000 for life next.
if its sum of states is >0, you will now divide up the second quarter into 4 pieces again - and repeat this check for life for each of these subspaces.
you could go down to subframes of 8*8 or 10*10 (not sure what makes the most sense here) now.
whenever you find life, you mark these subspaces as "has life".
only spaces which "have life" need to be divided into smaller subspaces - the empty ones can be skipped.
when you are finished assigning the "has life" attribute to all possible subspaces, you end up with a list of subspaces which you now simply extend by +1 to each direction - with empty cells - and perform the regular (or modified) game of life rules to them.
you might think that dividn up a 10,000*10,000 spae into subspaces of 8*8 is a lot os tasks - but accumulating their states values is in fact much, much less computing work than performing the GoL algo to each cell plus their 8 neighbours plus comparing the number and storing the new state for the net iteration somewhere...
but like i said above, for a random init state with 30% population this wont make much sense, as there will be not many completely dead 8*8 subspaces to find (leave alone dead 256*256 subpaces)
and of course, the way of perfect optimisation will last but not least depend on your language.

Two ideas:
(1) Many configurations are mostly empty space. Keep a linked list (not necessarily in order, that would take more time) of the live cells, and during an update, only update around the live cells (this is similar to your vague suggestion, OysterD :)
(2) Keep an extra array which stores the # of live cells in each row of 3 positions (left-center-right). Now when you compute the new dead/live value of a cell, you need only 4 read operations (top/bottom rows and the center-side positions), and 4 write operations (update the 3 affected row summary values, and the dead/live value of the new cell). This is a slight improvement from 8 reads and 1 write, assuming writes are no slower than reads. I'm guessing you might be able to be more clever with such configurations and arrive at an even better improvement along these lines.

If you don't want anything too complex, then you can use a grid to slice it up, and if that part of the grid is empty, don't try to simulate it (please view Tyler's answer). However, you could do a few optimizations:
Set different grid sizes depending on the amount of live cells, so if there's not a lot of live cells, that likely means they are in a tiny place.
When you randomize it, don't use the grid code until the user changes the data: I've personally tested randomizing it, and even after a long amount of time, it still fills most of the board (unless for a sufficiently small grid, at which point it won't help that much anymore)
If you are showing it to the screen, don't use rectangles for pixel size 1 and 2: instead set the pixels of the output. Any higher pixel size and I find it's okay to use the native rectangle-filling code. Also, preset the background so you don't have to fill the rectangles for the dead cells (not live, because live cells disappear pretty quickly)

Don't exactly know how this can be done, but I remember some of my friends had to represent this game's grid with a Quadtree for a assignment. I'm guess it's real good for optimizing the space of the grid since you basically only represent the occupied cells. I don't know about execution speed though.

It's a two dimensional automaton, so you can probably look up optimization techniques. Your notion seems to be about compressing the number of cells you need to check at each step. Since you only ever need to check cells that are occupied or adjacent to an occupied cell, perhaps you could keep a buffer of all such cells, updating it at each step as you process each cell.
If your field is initially empty, this will be much faster. You probably can find some balance point at which maintaining the buffer is more costly than processing all the cells.

There are table-driven solutions for this that resolve multiple cells in each table lookup. A google query should give you some examples.

I implemented this in C#:
All cells have a location, a neighbor count, a state, and access to the rule.
Put all the live cells in array B in array A.
Have all the cells in array A add 1 to the neighbor count of their
Have all the cells in array A put themselves and their neighbors in array B.
All the cells in Array B Update according to the rule and their state.
All the cells in Array B set their neighbors to 0.
Ignores cells that don't need to be updated
4 arrays: a 2d array for the grid, an array for the live cells, and an array
for the active cells.
Can't process rule B0.
Processes cells one by one.
Cells aren't just booleans
Possible improvements:
Cells also have an "Updated" value, they are updated only if they haven't
updated in the current tick, removing the need of array B as mentioned above
Instead of array B being the ones with live neighbors, array B could be the
cells without, and those check for rule B0.


How to save a matrix in C++ in a non-linear way

I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.
Now, I can compute diagonal elements concurrently. That is all alright.
My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore.
Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.
Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.
How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?
EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).
I believe you may have a mis-perception of (CPU) cache.
It's true that CPU caching is linear - that is, if you access an address in memory, it will bring into the cache some previous and some successive memory locations - which is like "guessing" that subsequent accesses will involve 1-dimensional-close elements. However, this is true on the micro-level. A CPU's cache is made up of a large number of small "lines" (64 Bytes on all cache levels in recent Intel CPUs). The locality is limited to the line; different cache lines can come from completely different places in memory.
Thus, if you "need two elements of the previous row and one element of the current row" of your matrix, then the cache should work very well for you: Some of the cache will hold elements of the previous row, and some will hold elements of the current row. And when you advance to the next element, the cache overall will usually contain the matrix elements you need to access. Just make sure your order of iteration agrees with the order of progression within the cache line.
Also, in some cases you could be faced with a situation where different threads are thrashing the same cache lines due to the mapping from main memory into the cache. Without getting into details, that is something you need to think about (but again, has nothing to do with 2D vs 1D data).
Edit: As geza notes, if your matrix' lines are long, you will still be reading each memory location twice with the straightforward approach: Once as the current-line, then again as the previous-line, since each value will be evicted from the cache before it's used as a previous-line value. If you want to avoid this, you can iterate over tiles of your matrix, whose size (length x width x sizeof(element)) fits into the L1 cache (along with whatever else needs to be there). You can also consider storing your data in tiles, but I don't think that would be too useful.
Preliminary comment: "Levenshtein distance" is edit distance (under the common definition). This is a very common problem; you probably don't even need to bother writing a solution yourself. Look for existing code.
Now, finally, for a proper answer... You don't actually need have a matrix at all, and you certainly don't need to "save" it: It's enough to keep merely a "front" of your dynamic programming matrix rather than the whole thing.
But what "front" shall you choose, and how do you advance it? I suggest you use anti-diagonals as your front, and given each anti-diagonal, compute concurrently the next anti-diagonal. Thus it'll be {(0,0)}, then {(0,1),(1,0)}, then {(0,2),(1,1),(2,0)} and so on. Each anti-diagonal requires at most two earlier anti-diagonals - and if we keep the values of each anti-diagonal consecutively in memory, then the access pattern going up the next anti-diagonal is a linear progression along the previous anti-diagonals - which is great for the cache (see my other answer).
So, you'll "concurrentize" the computation give each thread a bunch of consecutive anti-diagonal elements to compute; that should do the trick. And at any time you will only keep 3 anti-diagonal in memory: the one you're working on and the two previous ones. You can cycle between three such buffers so you don't re-allocate memory all the time (but then make sure to pre-allocate buffers with the maximum anti-diagonal length).
This whole thing should work basically the same for the non-square case.
I'm not absolutely sure, but i think a matrix is stored as a long array one row after the other and is mapped with pointer arithmetic to a matrix, so you always refer to the same address and calculate the distance in the memory where your value is located
Otherwise you can implement it easily as this type and implement operator[int, int] for your matrix

Are there any good techniques for keeping nearly-sorted data nearly-sorted?

Short Version:
I'm looking for a technique to keep nearly-sorted data in nearly-sorted order over time, despite the values changing slightly.
Here's the scenario:
In the world of 3D graphics, it is often beneficial to order your objects from front-to-back before drawing. As your scene changes or your view of the scene changes, this data may require re-sorting, however it will usually be very close to the sorted order (i.e. it won't change very much between frames). It's also not critical that the data be exactly in sorted order. The worst thing that will happen is that a polygon will be rendered and then completely hidden. It's a small performance hit, but not the end of the world.
With this in mind, is it possible to sort the data once ahead of time and then apply a minimal patch to the data once per frame to ensure that the data stays mostly sorted? In this scenario, the data would be considered mostly sorted if most of the objects were in ascending order. That is, 1 object that is 10 steps away from it's proper location is much better (10x better) than 10 objects that are 1 step away from their proper location.
It's also worth noting that the data could continue to be patched on a semi regular basis, as the data is typically rendered 30 times per second (or so). As long as the calculation was efficient, it could continue to be done over time until the changes stop and the list was completely sorted.
Existing Idea:
My knee jerk reaction to this problem is:
Apply an n log n sort to the data when it is loaded, and on large changes (which I can track pretty easily).
When the data starts changing slowly (e.g. when the scene is rotated), apply a single (linear) pass of some sort on the data to swap backwards neighbors and try to maintain sort order (I think this is basically shell sort - maybe there is a better algorithm to use for this single pass).
Keep doing a single pass of the partial sort each frame until the changes stop and the data is completely sorted
Go back to step 2 and wait for more changes.
There are a variety of sorts that run in O(n) time if the input is mostly sorted, and O(n log n) if the data is not sorted. It sounds like you can use that pretty easily. Timsort is one such sort and, I believe, is the default sort now in both python and java. Smoothsort is another one that is fairly easy to implement yourself.
From your description it sounds like the sort order changes without you changing the data itself. E.g. you change the camera, so the sort order should change, even though you have not modified any polygons.
If so, you can't detect sort order changes directly when they happen. If you could, I would create buckets for the list of polygons, and resort buckets when 'enough' polygons in that bucket have been touched.
But I'm betting your system doesn't work that way. The sort is determined by the view port. In that case polygons at the front of the sort matter much more than ones at the end.
So I'd segment the poly list into fifths or something like that. Front to back, so that the first fifth is the part closest to the camera. I'd completely sort the first segment every frame. I'd divide the second segment into sub segments - say 5 again - and sort each sub segment every frame, such that every 5 frames the second fifth is completely sorted. segment the third through 5th segments into 15 sub segments and do those every 5 frames each such that the rest get sorted completely every 75 frames. At 60 fps you'd have the display list completely resorted a little more than once per second.
The nice thing about prioritizing the front of the list, is
1. Polys at the front are going to tend to be larger on the screen, and will fail depth test more often. Bad orders at the end of the list will more often than not just not matter.
2. the front of the list is more susceptible sort changes due to camera changes.
Also chose those segment ranges with a little overlap, so that polygons can migrate to their correct segment in 2 sorts.
#OP: Thinking about it a little more. You are probably more concerned with having the sorting cost stay bounded - instead of exploding with scene complexity. Especially since a very complex scene should - surprisingly - be less susceptible to bad sorts ( because generally the polys get smaller ).
You could define a fixed amount of sorting you are willing to do per frame. Use say 50% of the budget for as much of the front of the list as you can afford, 25% of the budget to sort the next region and 25% to spend equally on the rest.
Say you budget 1000 polys sorted per frame, and you have 10000 polys in the scene. Sort the first 500 polys every frame. Sort 250 polys every tenth frame for the next region. So 501-750 on frame 1, 751-1000 on frame 2 etc. And then divide the rest of the list into 250 frame segments and sort them round robin for however many frames you need to.
This keeps the sorting cost fixed s the scene gets more and less complex, and it is easy to tune, you just adjust the sorting budget to what you can afford.
I'll suggest a solution that borrows from a number of others here. Of course we start with a full sort of the objects on initialisation.
What I would do is always perform, say, 10 linear-time runs over your objects for every frame (with early termination if you find out that your objects are already completely sorted). Each run can be, say, one pass of bubble sort with a shell sort-style gap over the whole array: for all i from 0 to n-gap-1, compare A[i] and A[i+gap], and exchange them if they are not sorted. You can use a fixed sequence of gaps, or maybe better, let it vary between frames; either way, if you do sufficiently many frames where the objects do not change, you'll have a fully sorted sequence. You could even mix different types of sub-algorithms to do your runs, as long as each iteration improves the 'sortedness'.
You can add Rafael Baptista's idea of prioritizing the front of the scene easily by doing one extra run on the front segment, or choosing to divide the gap by two for the front half, or something like that.
It doesn't work out as neatly as the problem you've supposed because all you have to do is turn the camera 90 degrees and the basis for being sorted is on a different axis entirely. (X and Y axis are independent, for example -- looking down the X axis will cause the sort order to not rely on the X axis, and looking down the Y axis will cause the sort order to not rely on the Y axis.) Even a 5 degree turn can cause far away "close" (as far as Z-order is concerned) things to be suddenly "far".
Let's be honest -- generating the draw calls for the objects is normally going to take much more time than sorting them, especially if you have an optimized sorting algorithm for your scenario and your game is of modern visual complexity.
Sorting can be practically O(n), especially with histogram-based algorithms or radix-style algorithms. (Yes, radix sort applies to integers, so you'd have to scale your world coordinates to integers, but normally that's more than good enough unless you have a gigantic world.)
That being said, since you're already doing O(n) ops for everything you're drawing, resorting per frame isn't going to be a huge problem, especially with both high and low level optimization.
Another common way of addressing this issue is with a scene graph, but for your purposes it ends up essentially being a re-sort per frame. However, you can build frustum culling, shadow culling, and level of detail calculations into the scene graph traversal.
If you're looking for approximations, instead of doing a z-distance sort do a true distance sort and update the sort order more often for close by objects and less often for further objects (depending on distance the camera has traveled). This can work because if you're further away from an object, moving doesn't cause the angle to the viewer to change as often which, in turn, means the old sorting data is more likely to be valid. I'm not a fan of this because I like algorithms which allow my game to teleport across the map without any issues. (Mind you, streaming assets from disk becomes the real issue for teleporting.)
Shell sort is good for lists with few unique values and some scenarios that "need short code and do not use the call stack".
In your case, you need something called Adaptive sort, which means algorithms "takes advantage of existing order in its input".
If your space is tight, you can just use Straight Insertion Sort, which is adaptive and in place.
Otherwise you can try Timsort and Smoothsort as #RunningWild suggested, they are both adaptive sort algorithms.

Predator-prey simulation

I'm trying to implement a model of predator-prey.
It is agent-based model. Every few milliseconds is a new move. On the field there are two types of creatures: predator and prey. The behavior of each of them is given by the following rules:
Just moved to an unoccupied cell
Every few steps creates offspring to his old cell
Life expectancy is limited by the number of moves
Predator moves to the cell with prey. If such cells are not, in any
free neighboring cell
I have a problem with the choice of prey move.
For example, I have preys in cells 5 and 9.
Each of them can move to cell 6.
How can I resolve this conflict?
Use asynchronous updating. Iterate through the prey in random order, having them decide in turn to which cell they should move.
This is a common approach in simulations. It has an additional benefit in that it eliminates limit cycles in the dynamics.
How long does 'moving' take? If you move one, then after the prey has moved, you move the next one, there is no conflict. The prey will simply see the space is already occupied and move elsewhere.
If moving takes time you might say the prey keep an eye on each other and see if some other prey is trying to move somewhere (like people watch cars in traffic). Then you would change the status of the target field to 'reserved for 5' when prey from 5 is trying to move there. Then prey from 9 can see this and decide if they want to collide with 5 (could be intresting :P) or avoid 5.
Depends on game logic. If preys can be on the same cell, so simply use indicator that show preys count. If you are using 2D array for representing current field state you can use such codes:
-1 - predator
n - preys
n >= 0, (n = 0 - cell is empty, n = 1 cell contains 1 prey and so on).
Otherwise (if preys can't appear on the same cell) use turn-based strategy. Save all your preys in array or give number to each prey. In that case preys' moves represents by simple loop (pseudocode):
for each prey in preys
where move logic describes algorithm how your prey moves.
Quite a few ways, depending on if you're deciding & moving as two steps or one, etc:
Keep track of each prey's intended moves, and prevent other prey from occupying those.
Check if another prey is already occupying the destination, and do nothing if so.
Remove one of the preys at random if they both try to occupy the same location.
Re-evaluate the move options if the destination is occupied.
There's not really a 'right' way to do it.
See this related question and my answer.
It describes a good collision detection mechanism.
Avoid O(n^2) complexity for collision detection

Shuffle and deal a deck of card with constraints

Here is the facts first.
In the game of bridge there are 4
players named North, South, East and
All 52 cards are dealt with 13 cards
to each player.
There is a Honour counting systems.
Ace=4 points, King=3 points, Queen=2
points and Jack=1 point.
I'm creating a "Card dealer" with constraints where for example you might say that the hand dealt to north has to have exactly 5 spades and between 13 to 16 Honour counting points, the rest of the hands are random.
How do I accomplish this without affecting the "randomness" in the best way and also having effective code?
I'm coding in C# and .Net but some idea in Pseudo code would be nice!
Since somebody already mentioned my Deal 3.1, I'd like to point out some of the optimizations I made in that code.
First of all, to get the most flexibly constraints, I wanted to add a complete programming language to my dealer, so you could generate whole libraries of constraints with different types of evaluators and rules. I used Tcl for that language, because I was already learning it for work, and, in 1994 when Deal 0.0 was released, Tcl was the easiest language to embed inside a C application.
Second, I needed the constraint language to run fairly fast. The constraints are running deep inside the loop. Quite a lot of code in my dealer is little optimizations with lookup tables and the like.
One of the most surprising and simple optimizations was to not deal cards to a seat until a constraint is checked on that seat. For example, if you want north to match constraint A and south to match constraint B, and your constraint code is:
match constraint A to north
match constraint B to south
Then only when you get to the first line do you fill out the north hand. If it fails, you reject the complete deal. If it passes, next fill out the south hand and check its constraint. If it fails, throw out the entire deal. Otherwise, finish the deal and accept it.
I found this optimization when doing some profiling and noticing that most of the time was spent in the random number generator.
There is one fancy optimization, which can work in some instances, call "smart stacking."
deal::input smartstack south balanced hcp 20 21
This generates a "factory" for the south hand which takes some time to build but which can then very quickly fill out the one hand to match this criteria. Smart stacking can only be applied to one hand per deal at a time, because of conditional probability problems. [*]
Smart stacking takes a "shape class" - in this case, "balanced," a "holding evaluator", in this case, "hcp", and a range of values for the holding evaluator. A "holding evaluator" is any evaluator which is applied to each suit and then totaled, so hcp, controls, losers, and hcp_plus_shape, etc. are all holding evalators.
For smartstacking to be effective, the holding evaluator needs to take a fairly limited set of values. How does smart stacking work? That might be a bit more than I have time to post here, but it's basically a huge set of tables.
One last comment: If you really only want this program for bidding practice, and not for simulations, a lot of these optimizations are probably unnecessary. That's because the very nature of practicing makes it unworthy of the time to practice bids that are extremely rare. So if you have a condition which only comes up once in a billion deals, you really might not want to worry about it. :)
[Edit: Add smart stacking details.]
Okay, there are exactly 8192=2^13 possible holdings in a suit. Group them by length and honor count:
Holdings(length,points) = { set of holdings with this length and honor count }
Holdings(3,7) = {AK2, AK3,...,AKT,AQJ}
and let
h(length,points) = |Holdings(length,points)|
Now list all shapes that match your shape condition (spades=5):
Note that the collection of all possible hand shapes has size 560, so this list is not huge.
For each shape, list the ways you can get the total honor points you are looking for by listing the honor points per suit. For example,
Shape Points per suit
5-4-4-0 10-3-0-0
5-4-4-0 10-2-1-0
5-4-4-0 10-1-2-0
5-4-4-0 10-0-3-0
5-4-4-0 9-4-0-0
Using our sets Holdings(length,points), we can compute the number of ways to get each of these rows.
For example, for the row 5-4-4-0 10-3-0-0, you'd have:
So, pick one of these rows at random, with relative probability based on the count, and then, for each suit, choose a holding at random from the correct Holdings() set.
Obviously, the wider the range of hand shapes and points, the more rows you will need to pre-compute. A little more code, you can still do this with some cards pre-determined - if you know where the ace of spades or west's whole hand or whatever.
[*] In theory, you can solve these conditional probability issues for smart stacking with multiple hands, but the solution to the problem would make it effective only for extremely rare types of deals. That's because the number of rows in the factory's table is roughly the product of the number of rows for stacking one hand times the number of rows for stacking the other hand. Also, the h() table has to be keyed on the number of ways of dividing the n cards amongst hand 1, hand 2, and other hands, which changes the number of values from roughly 2^13 to 3^13 possible values, which is about two orders of magnitude bigger.
Since the numbers are quite small here, you could just take the heuristic approach: Randomly deal your cards, evaluate the constraints and just deal again if they are not met.
Depending on how fast your computer is, it might be enough to do this:
do a random deal
Until the board meets all the constraints
As with all performance questions, the thing to do is try it and see!
edit I tried it and saw:
done 1000000 hands in 12914 ms, 4424 ok
This is without giving any thought to optimisation - and it produces 342 hands per second meeting your criteria of "North has 5 spades and 13-16 honour points". I don't know the details of your application but it seems to me that this might be enough.
I would go for this flow, which I think does not affect the randomness (other than by pruning solutions that do not meet constraints):
List in your program all possible combinations of "valued" cards whose total Honour points count is between 13 and 16. Then pick randomly one of these combinations, removing the cards from a fresh deck.
Count how many spades you already have among the valued cards, and pick randomly among the remaining spades of the deck until you meet the count.
Now pick from the deck as much non-spades, non-valued cards as you need to complete the hand.
Finally pick the other hands among the remaining cards.
You can write a program that generates the combinations of my first point, or simply hardcode them while accounting for color symmetries to reduce the number of lines of code :)
Since you want to practise bidding, I guess you will likely be having various forms of constraints (and not just 1S opening, as I guess for this current problem) coming up in the future. Trying to come up with the optimal hand generation tailored to the constraints could be a huge time sink and not really worth the effort.
I would suggest you use rejection sampling: Generate a random deal (without any constraints) and test if it satisfies your constraints.
In order to make this feasible, I suggest you concentrate on making the random deal generation (without any constraints) as fast as you can.
To do this, map each hand to a 12byte integer (the total number of bridge hands fits in 12 bytes). Generating a random 12 byte integer can be done in just 3, 4 byte random number calls, of course since the number of hands is not exactly fitting in 12 bytes, you might have a bit of processing to do here, but I expect it won't be too much.
Richard Pavlicek has an excellent page (with algorithms) to map a deal to a number and back.
See here: http://www.rpbridge.net/7z68.htm
I would also suggest you look at the existing bridge hand dealing software (like Deal 3.1, which is freely available) too. Deal 3.1 also supports doing double dummy analysis. Perhaps you could make it work for you without having to roll one of your own.
Hope that helps.

Mahjong - Arrange tiles to ensure at least one path to victory, regardless of layout

Regardless of the layout being used for the tiles, is there any good way to divvy out the tiles so that you can guarantee the user that, at the beginning of the game, there exists at least one path to completing the puzzle and winning the game?
Obviously, depending on the user's moves, they can cut themselves off from winning. I just want to be able to always tell the user that the puzzle is winnable if they play well.
If you randomly place tiles at the beginning of the game, it's possible that the user could make a few moves and not be able to do any more. The knowledge that a puzzle is at least solvable should make it more fun to play.
Place all the tiles in reverse (ie layout out the board starting in the middle, working out)
To tease the player further, you could do it visibly but at very high speed.
Play the game in reverse.
Randomly lay out pieces pair by pair, in places where you could slide them into the heap. You'll need a way to know where you're allowed to place pieces in order to end up with a heap that matches some preset pattern, but you'd need that anyway.
I know this is an old question, but I came across this when solving the problem myself. None of the answers here are quite perfect, and several of them have complicated caveats or will break on pathological layouts. Here is my solution:
Solve the board (forward, not backward) with unmarked tiles. Remove two free tiles at a time. Push each pair you remove onto a "matched pair" stack. Often, this is all you need to do.
If you run into a dead end (numFreeTiles == 1), just reset your generator :) I have found I usually don't hit dead ends, and have so far have a max retry count of 3 for the 10-or-so layouts I have tried. Once I hit 8 retries, I give up and just randomly assign the rest of the tiles. This allows me to use the same generator for both setting up the board, and the shuffle feature, even if the player screwed up and made a 100% unsolvable state.
Another solution when you hit a dead end is to back out (pop off the stack, replacing tiles on the board) until you can take a different path. Take a different path by making sure you match pairs that will remove the original blocking tile.
Unfortunately, depending on the board, this may loop forever. If you end up removing a pair that resembles a "no outlet" road, where all subsequent "roads" are a dead end, and there are multiple dead ends, your algorithm will never complete. I don't know if it is possible to design a board where this would be the case, but if so, there is still a solution.
To solve that bigger problem, treat each possible board state as a node in a DAG, with each selected pair being an edge on that graph. Do a random traversal, until you find a leaf node at depth 72. Keep track of your traversal history so that you never repeat a descent.
Since dead ends are more rare than first-try solutions in the layouts I have used, what immediately comes to mind is a hybrid solution. First try to solve it with minimal memory (store selected pairs on your stack). Once you've hit the first dead end, degrade to doing full marking/edge generation when visiting each node (lazy evaluation where possible).
I've done very little study of graph theory, though, so maybe there's a better solution to the DAG random traversal/search problem :)
Edit: You actually could use any of my solutions w/ generating the board in reverse, ala the Oct 13th 2008 post. You still have the same caveats, because you can still end up with dead ends. Generating a board in reverse has more complicated rules, though. E.g, you are guaranteed to fail your setup if you don't start at least SOME of your rows w/ the first piece in the middle, such as in a layout w/ 1 long row. Picking a completely random (legal) first move in a forward-solving generator is more likely to lead to a solvable board.
The only thing I've been able to come up with is to place the tiles down in matching pairs as kind of a reverse Mahjong Solitaire game. So, at any point during the tile placement, the board should look like it's in the middle of a real game (ie no tiles floating 3 layers up above other tiles).
If the tiles are place in matching pairs in a reverse game, it should always result in at least one forward path to solve the game.
I'd love to hear other ideas.
I believe the best answer has already been pushed up: creating a set by solving it "in reverse" - i.e. starting with a blank board, then adding a pair somewhere, add another pair in a solvable position, and so on...
If you a prefer "Big Bang" approach (generating the whole set randomly at the beginning), are a very macho developer or just feel masochistic today, you could represent all the pairs you can take out from the given set and how they depend on each other via a directed graph.
From there, you'd only have to get the transitive closure of that set and determine if there's at least one path from at least one of the initial legal pairs that leads to the desired end (no tile pairs left).
Implementing this solution is left as an exercise to the reader :D
Here are rules i used in my implementation.
When buildingheap, for each fret in a pair separately, find a cells (places), which are:
has all cells at lower levels already filled
place for second fret does not block first, considering if first fret already put onboard
both places are "at edges" of already built heap:
EITHER has at least one neighbour at left or right side
OR it is first fret in a row (all cells at right and left are recursively free)
These rules does not guarantee a build will always successful - it sometimes leave last 2 free cells self-blocking, and build should be retried (or at least last few frets)
In practice, "turtle" built in no more then 6 retries.
Most of existed games seems to restrict putting first ("first on row") frets somewhere in a middle. This come up with more convenient configurations, when there are no frets at edges of very long rows, staying up until last player moves. However, "middle" is different for different configurations.
Good luck :)
If you've found algo that build solvable heap in one turn - please let me know.
You have 144 tiles in the game, each of the 144 tiles has a block list..
(top tile on stack has an empty block list)
All valid moves require that their "current__vertical_Block_list" be empty.. this can be a 144x144 matrix so 20k of memory plus a LEFT and RIGHT block list, also 20 k each.
Generate a valid move table from (remaning_tiles) AND ((empty CURRENT VERTICAL BLOCK LIST) and ((empty CURRENT LEFT BLOCK LIST) OR (empty CURRENT RIGHT BLOCK LIST)))
Pick 2 random tiles from the valid move table, record them
Update the (current tables Vert, left and right), record the Tiles removed to a stack
Now we have a list of moves that constitute a valid game. Assign matching tile types to each of the 72 moves.
for challenging games, track when each tile becomes available. find sets that have are (early early early late) and (late late late early) since it's blank, you find 1 EE 1 LL and 2 LE blocks.. of the 2 LE block, find an EARLY that blocks ANY other EARLY that (except rightblocking a left side piece)
Once youve got a valid game play around with the ordering.
Solitaire? Just a guess, but I would assume that your computer would need to beat the game(or close to it) to determine this.
Another option might be to have several preset layouts(that allow winning, mixed in with your current level.
To some degree you could try making sure that one of the 4 tiles is no more than X layers below another X.
Most games I see have the shuffle command for when someone gets stuck.
I would try a mix of things and see what works best.
