Why is branch prediction quite accurate? - cpu

Why is branch prediction accurate? Can we generally think of it at a high level in terms of how certain branches of our code execute 99% of time, while the rest is special cases and exception handling?
My question my be a little vague but I am only interested in high level view on this. Let me give you an example
Say you have a function with a parameter
void execute(Input param) {
assertNotEmpty(param)
(...)
}
I execute my function conditionally given parameter isn't empty. 99% of times this parameter will indeed be non empty. Can I then think of neural network based branch prediction for example, in a way, that as it has seen such instruction flow countless times (such assertions are quite common), it will simply learn that most of the time that parameter is non empty and take branch accordingly?
Can we then think of our code in terms of - the cleaner, the more predictable it is, or even more common - the easier we make it for branch predictor?
Thanks!

A short history of how branches are predicted:
When Great-Granny was programming
there was no prediction and no pre-fetch, soon she started pre-fetching the next instruction while executing the current instruction. Most of the times this was correct and improved the clock per instruction in most cases by one and otherwise nothing was lost. This already had a misprediction rate of only average 34% (59%-9%, H&P AQA p.81).
When Granny was programming
There was the problem that the CPU's were getting faster and added a Decoding stage to the pipeline, making it Fetch -> Decode -> Execute -> Write back. With 5 instructions between branches 2 fetches were lost every 5 instructions if the branch was backward or forward and was respectively taken and not taken. A quick research showed that most conditional backward branches were loops and most were taken and most forward was not taken, as they mostly were bad cases. With profiling we get down to 3%-24%
The advent of the dynamic branch predictor with the saturation counter
made life for the programmer easier. From the observation that most branches do what they did last time, having a list of counters address with the low bits of the address of a branch told if the branch was taken or not and the Branch Target Buffer provided the address to be fetched. On this local predictor it lowers the mis-prediction rate to 1%-18%.
This is all good and fine, but some branches are depended on how previous other branches acted. So if we have a history of the last branches take or not taken as 1 and 0 we have 2^H different predictors depending on the history. In practice the history bits are xor'ed with the branch lower address bits, using the same array as in the previous version.
The PRO of this is that the predictor can quickly learn patterns, the CON is if there is no pattern the branch will overwrite the previous branches bits. The PRO outweighs the CON as the locality is more important than branches that are not in the current (inner) loop. This global predictor improve the mis-prediction down to 1%-11%.
That is great, but in some cases the local predictor beats the global predictor so we want both. Xor-ing the local branch history with the address improves on the local branch prediction making it a 2 level predictor as well, just with local instead of global branch history. Adding a 3rd saturation counter for each branch that counts which was right we can select between them. This tournament predictor improves the misprediction rate with around 1% point compared with the global predictor.
Now your case is one in 100 branches in another direction.
Lets examine the local two level predictor, when we get to the one case the last H branches of this branches have all been in the same direction, lets say taken, making all history 1's so the branch predictor will have chosen a single entry in the local predictor table and it will be saturated to taken. This means it will in all cases case an mis-predict on the one case, and the next call where the branch will be taken will most likely be correctly predicted (barring aliasing to the branch table entry). So the local branch predictor can't be used as having a 100 bit long history would require a 2^100 large predictor.
Maybe the global predictor catch the case then, in the last 99 cases the branch was taken, so the predictors for the last 99 will have updated according to the different behaviour of the last H branches moving them to predict taken. So if the last H branches have independent behaviour from the current branch, then all the entries in the global branch prediction table will predict taken and so you will get a mis-predict.
But if a combination of previous branches, say the 3rd, 7th and 12th, all acted so that if the right combination of these were taken/not taken it would foreshadow the opposite behavior, the branch prediction entry of this combination would correctly predict the behaviour of the branch. The problem here is that if you only seldom, seen in the runtime over the program, updates this branch entry and other branches alias to it with their behaviour then it might fail to predict anyway.
Let assume the global branch behaviour actually predicts the right outcome based on the pattern of previous branches. Then you will most likely be mislead by the tournament predictor which says the local predictor is "always" right and the local predictor will always mis-predict for your case.
Note 1: The "always" should be taken with a small grain of sand, as other branches might pollute your branch table entries with aliasing to the same entry. The designers have tried to make this less likely with having 8K different entries, creatively rearranging the bits of the lower address of the branch.
Note 2: Other schemes might be able to solve this but unlikely as its 1 in 100.

There are couple of reasons that allow us to develop good branch predictors:
Bi-modal distribution - the outcome of branches is often bimodally distributed, i.e. an individual branch is often highly biased towards taken or untaken. If the distribution of most branches would be uniform then it'd be impossible to devise a good prediction algorithm.
Dependency between branches - in real-world programs, there is a significant amount of dependency between distinct branches, that is the outcome of one branch affects the outcome of another branch. For example:
if (var1 == 3) // b1
var1 = 0;
if (var2 == 3) // b2
var2 = 0;
if (var1 != var2) // b3
...
The outcome of branch b3 here depends on the outcome of branches b1 and b2. If both b1 and b2 are untaken (that is their conditions evaluate to true and var1 and var2 are assigned 0) then branch b3 will be taken. The predictor that looks at a single branch only has no way to capture this behavior. Algorithms that examine this inter-branch behavior are called two-level predictors.
You didn't ask for any particular algorithms so I won't describe any of them, but I'll mention the 2-bit prediction buffer scheme that works reasonably well and is quite simple to implement (essentially, one keeps track of outcomes of a particular branch in a cache and makes decision based on the current state in the cache). This scheme was implemented in the MIPS R10000 processor and the results showed prediction accuracy of ~90%.
I'm not sure about application of NNs to branch-prediction - it does seem possible to design an algorithm based on NNs. However, I believe it wouldn't have any practical usage as: a) it would be too complex to implement in hardware (so it'd take too many gates and introduce a lot of delay); b) it wouldn't have significant improvement on predictor's performance compared to traditional algorithms that are much easier to implement.

Many languages provides mechanisms to tell the compiler thich branch is most expected result. It helps the compiler to organise the code to maximise positive branch predictions. An example gcc __builtin_expect, likely, unlikely

Related

Statistical Analysis of Runtime Measurements of a Parallel Algorithm

Problem Introduction
Assume we have a parallel algorithm f(<params>) running on P cores whereas
<params>: Parameters for algorithm
P: Number of cores it runs on (i.e. threads, cores, processors)
We further assume that out implementation actually consists of three parts:
A - Distribution: We distribute the input to all processors
B - Run the algorithm: We run f(<params>) ("on each processor")
C - Collection: We collect the computed data from all processors
After fixing <params> and P like input size, number of processors etc. the algorithm itself is deterministic i.e. we can write down an exact cDAG for it.
I'm now trying to answer the question: "For a given set of parameters, what is the execution time for a given system?"
With "given system" I mean e.g. "my computer" or "the university super computer" because obviously, the runtime does depend on the system it runs on and obviously the system itself does introduce non-determinism because you never really know the state of the system.
So in short: While the algorithm might be deterministic, runtime measurements aren't. (but e.g. communication measurements would be deterministic.) So we need to do a proper statistical analysis. And this is where I'm unsure.
Measuring Runtime: Basic idea
We are interested in how long part "B - Run the Algorithm" takes. Since the algorithm actually runs on P cores we'll make a measurement on each core and so get P values, let call those P values P_measurements. Some cores might finish before others, so which value does represent the runtime of the whole algorithm? I think a good choice is to simply take value of the core that took the longest i.e. max(P_measurements).
Now there are two things that need consideration here:
We have to repeat the measurement n times since it's a non-deterministic value
Once we have those n*P values, we need to know how to properly summarize them.
(And additional concern would be how to communicate those results in the end, but that's not part of this question.)
Measuring Runtime: Statistical Analysis
So here's what I'd do and this is also the part where I'm very unsure.
We measure the runtime of f(<params>) on each of the P cores. We get P_measurements
We take max(P_measurements)
We repeat 1. & 2. n times and we end up with maxes. Whereas maxes is a list of the n values max(P_measurements)
We check if maxesis normally distributed using a Q-Q-Plot. If not, we normalize. We do expect it to be right-skewed.
Now we take the median of maxes. (If we normalized, we use the normalized values)
We compute the standard deviation, the population mean and the 95% confidence interval.
We might want to say that all values are of an error of e.g. 5% so we check if all the values lie between +-5% of the population mean i.e. the confidence interval should be rather "thin".
We got ourselves some nice runtime measurement.
Clarifications:
Step 4. was necessary because computing the CI in step 6 uses the t-distribution and because later on I want to measure a different implementation of the same algorithm. So I'll have to compare two values and for that I need to do e.g. a t-test. So I need to make sure, the prerequisites for the t-test are met, which are: iid & normally distributed. Iid is assumed.
Question
I am very unsure what I did is statistically sound. Especially step 1-3. I'm not sure if I can do that kind of summarization (just take the max) here. I know that we might have an outsider value that's "especially" high but since we only measure on super computers we can assume the noise to be low and since we take the median in the end any outliners shouldn't have a big impact.
I hope for good input since it's a rather complex topic and I'm very interested in doing it right. I mostly followed the following paper, which I can recommend: http://spcl.inf.ethz.ch/Teaching/2020-dphpc/hoefler-scientific-benchmarking.pdf
But even with the paper, I'm not used to use statistical analysis and thus would just like to get some input from people who actually know this stuff. :)

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

How to run MCTS on a highly non-deterministic system?

I'm trying to implement a MCTS algorithm for the AI of a small game. The game is a rpg-simulation. The AI should decides what moves to play in battle. It's a turn base battle (FF6-7 style). There is no movement involved.
I won't go into details but we can safely assume that we know with certainty what move will chose the player in any given situation when it is its turn to play.
Games end-up when one party has no unit alive (4v4). It can take any number of turn (may also never end). There is a lot of RNG element in the damage computation & skill processing (attacks can hit/miss, crit or not, there is a lots of procs going on that can "proc" or not, buffs can have % value to happens ect...).
Units have around 6 skills each to give an idea of the branching factor.
I've build-up a preliminary version of the MCTS that gives poor results for now. I'm having trouble with a few things :
One of my main issue is how to handle the non-deterministic states of my moves. I've read a few papers about this but I'm still in the dark.
Some suggest determinizing the game information and run a MCTS tree on that, repeat the process N times to cover a broad range of possible game states and use that information to take your final decision. In the end, it does multiply by a huge factor our computing time since we have to compute N times a MCTS tree instead of one. I cannot rely on that since over the course of a fight I've got thousands of RNG element : 2^1000 MCTS tree to compute where i already struggle with one is not an option :)
I had the idea of adding X children for the same move but it does not seems to be leading to a good answer either. It smooth the RNG curve a bit but can shift it in the opposite direction if the value of X is too big/small compared to the percentage of a particular RNG. And since I got multiple RNG par move (hit change, crit chance, percentage to proc something etc...) I cannot find a decent value of X that satisfies every cases. More of a badband-aid than anythign else.
Likewise adding 1 node per RNG tuple {hit or miss ,crit or not,proc1 or not,proc2 or not,etc...} for each move should cover every possible situations but has some heavy drawbacks : with 5 RNG mecanisms only that means 2^5 node to consider for each move, it is way too much to compute. If we manage to create them all, we could assign them a probability ( linked to the probability of each RNG element in the node's tuple) and use that probability during our selection phase. This should work overall but be really hard on the cpu :/
I also cannot "merge" them in one single node since I've got no way of averaging the player/monsters stat's value accuractely based on two different game state and averaging the move's result during the move processing itself is doable but requieres a lot of simplifcation that are a pain to code and will hurt our accuracy really fast anyway.
Do you have any ideas how to approach this problem ?
Some other aspects of the algorithm are eluding me:
I cannot do a full playout untill a end state because A) It would take a lot of my computing time and B) Some battle may never ends (by design). I've got 2 solutions (that i can mix)
- Do a random playout for X turns
- Use an evaluation function to try and score the situation.
Even if I consider only health point to evaluate I'm failing to find a good evaluation function to return a reliable value for a given situation (between 1-4 units for the player and the same for the monsters ; I know their hp current/max value). What bothers me is that the fights can vary greatly in length / disparity of powers. That means that sometimes a 0.01% change in Hp matters (for a long game vs a boss for example) and sometimes it is just insignificant (when the player farm a low lvl zone compared to him).
The disparity of power and Hp variance between fights means that my Biais parameter in the UCB selection process is hard to fix. i'm currently using something very low, like 0.03. Anything > 0.1 and the exploration factor is so high that my tree is constructed depth by depth :/
For now I'm also using a biaised way to choose move during my simulation phase : it select the move that the player would choose in the situation and random ones for the AI, leading to a simulation biaised in favor of the player. I've tried using a pure random one for both, but it seems to give worse results. Do you think having a biaised simulation phase works against the purpose of the alogorithm? I'm inclined to think it would just give a pessimistic view to the AI and would not impact the end result too much. Maybe I'm wrong thought.
Any help is welcome :)
I think this question is way too broad for StackOverflow, but I'll give you some thoughts:
Using stochastic or probability in tree searches is usually called expectimax searches. You can find a good summary and pseudo-code for Expectimax Approximation with Monte-Carlo Tree Search in chapter 4, but I would recommend using a normal minimax tree search with the expectimax extension. There are a few modifications like Star1, Star2 and Star2.5 for a better runtime (similiar to alpha-beta pruning).
It boils down to not only having decision nodes, but also chance nodes. The probability of each possible outcome should be known and the expected value of each node is multiplied with its probability to know its real expected value.
2^5 nodes per move is high, but not impossibly high, especially for low number of moves and a shallow search. Even a 1-3 depth search shoulld give you some results. In my tetris AI, there are ~30 different possible moves to consider and I calculate the result of three following pieces (for each possible) to select my move. This is done in 2 seconds. I'm sure you have much more time for calculation since you're waiting for user input.
If you know what move the player is obvious, shouldn't it also obvious for your AI?
You don't need to consider a single value (hp), you can have several factors that are weighted different to calculate the expected value. If I come back to my tetris AI, there are 7 factors (bumpiness, highest piece, number of holes, ...) that are calculated, weighted and added together. To get the weights, you could use different methods, I used a genetic algorithm to find the combination of weights that resulted in most lines cleared.

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Optimizing Conway's 'Game of Life'

To experiment, I've (long ago) implemented Conway's Game of Life (and I'm aware of this related question!).
My implementation worked by keeping 2 arrays of booleans, representing the 'last state', and the 'state being updated' (the 2 arrays being swapped at each iteration). While this is reasonably fast, I've often wondered about how to optimize this.
One idea, for example, would be to precompute at iteration N the zones that could be modified at iteration (N+1) (so that if a cell does not belong to such a zone, it won't even be considered for modification at iteration (N+1)). I'm aware that this is very vague, and I never took time to go into the details...
Do you have any ideas (or experience!) of how to go about optimizing (for speed) Game of Life iterations?
I am going to quote my answer from the other question, because the chapters I mention have some very interesting and fine-tuned solutions. Some of the implementation details are in c and/or assembly, yes, but for the most part the algorithms can work in any language:
Chapters 17 and 18 of
Michael Abrash's Graphics
Programmer's Black Book are one of
the most interesting reads I have ever
had. It is a lesson in thinking
outside the box. The whole book is
great really, but the final optimized
solutions to the Game of Life are
incredible bits of programming.
There are some super-fast implementations that (from memory) represent cells of 8 or more adjacent squares as bit patterns and use that as an index into a large array of precalculated values to determine in a single machine instruction if a cell is live or dead.
Check out here:
http://dotat.at/prog/life/life.html
Also XLife:
http://linux.maruhn.com/sec/xlife.html
You should look into Hashlife, the ultimate optimization. It uses the quadtree approach that skinp mentioned.
As mentioned in Arbash's Black Book, one of the most simple and straight forward ways to get a huge speedup is to keep a change list.
Instead of iterating through the entire cell grid each time, keep a copy of all the cells that you change.
This will narrow down the work you have to do on each iteration.
The algorithm itself is inherently parallelizable. Using the same double-buffered method in an unoptimized CUDA kernel, I'm getting around 25ms per generation in a 4096x4096 wrapped world.
what is the most efficient algo mainly depends on the initial state.
if the majority of cells is dead, you could save a lot of CPU time by skipping empty parts and not calculating stuff cell by cell.
im my opinion it can make sense to check for completely dead spaces first, when your initial state is something like "random, but with chance for life lower than 5%."
i would just divide the matrix up into halves and start checking the bigger ones first.
so if you have a field of 10,000 * 10,000, you´d first accumulate the states of the upper left quarter of 5,000 * 5,000.
and if the sum of states is zero in the first quarter, you can ignore this first quarter completely now and check the upper right 5,000 * 5,000 for life next.
if its sum of states is >0, you will now divide up the second quarter into 4 pieces again - and repeat this check for life for each of these subspaces.
you could go down to subframes of 8*8 or 10*10 (not sure what makes the most sense here) now.
whenever you find life, you mark these subspaces as "has life".
only spaces which "have life" need to be divided into smaller subspaces - the empty ones can be skipped.
when you are finished assigning the "has life" attribute to all possible subspaces, you end up with a list of subspaces which you now simply extend by +1 to each direction - with empty cells - and perform the regular (or modified) game of life rules to them.
you might think that dividn up a 10,000*10,000 spae into subspaces of 8*8 is a lot os tasks - but accumulating their states values is in fact much, much less computing work than performing the GoL algo to each cell plus their 8 neighbours plus comparing the number and storing the new state for the net iteration somewhere...
but like i said above, for a random init state with 30% population this wont make much sense, as there will be not many completely dead 8*8 subspaces to find (leave alone dead 256*256 subpaces)
and of course, the way of perfect optimisation will last but not least depend on your language.
-110
Two ideas:
(1) Many configurations are mostly empty space. Keep a linked list (not necessarily in order, that would take more time) of the live cells, and during an update, only update around the live cells (this is similar to your vague suggestion, OysterD :)
(2) Keep an extra array which stores the # of live cells in each row of 3 positions (left-center-right). Now when you compute the new dead/live value of a cell, you need only 4 read operations (top/bottom rows and the center-side positions), and 4 write operations (update the 3 affected row summary values, and the dead/live value of the new cell). This is a slight improvement from 8 reads and 1 write, assuming writes are no slower than reads. I'm guessing you might be able to be more clever with such configurations and arrive at an even better improvement along these lines.
If you don't want anything too complex, then you can use a grid to slice it up, and if that part of the grid is empty, don't try to simulate it (please view Tyler's answer). However, you could do a few optimizations:
Set different grid sizes depending on the amount of live cells, so if there's not a lot of live cells, that likely means they are in a tiny place.
When you randomize it, don't use the grid code until the user changes the data: I've personally tested randomizing it, and even after a long amount of time, it still fills most of the board (unless for a sufficiently small grid, at which point it won't help that much anymore)
If you are showing it to the screen, don't use rectangles for pixel size 1 and 2: instead set the pixels of the output. Any higher pixel size and I find it's okay to use the native rectangle-filling code. Also, preset the background so you don't have to fill the rectangles for the dead cells (not live, because live cells disappear pretty quickly)
Don't exactly know how this can be done, but I remember some of my friends had to represent this game's grid with a Quadtree for a assignment. I'm guess it's real good for optimizing the space of the grid since you basically only represent the occupied cells. I don't know about execution speed though.
It's a two dimensional automaton, so you can probably look up optimization techniques. Your notion seems to be about compressing the number of cells you need to check at each step. Since you only ever need to check cells that are occupied or adjacent to an occupied cell, perhaps you could keep a buffer of all such cells, updating it at each step as you process each cell.
If your field is initially empty, this will be much faster. You probably can find some balance point at which maintaining the buffer is more costly than processing all the cells.
There are table-driven solutions for this that resolve multiple cells in each table lookup. A google query should give you some examples.
I implemented this in C#:
All cells have a location, a neighbor count, a state, and access to the rule.
Put all the live cells in array B in array A.
Have all the cells in array A add 1 to the neighbor count of their
neighbors.
Have all the cells in array A put themselves and their neighbors in array B.
All the cells in Array B Update according to the rule and their state.
All the cells in Array B set their neighbors to 0.
Pros:
Ignores cells that don't need to be updated
Cons:
4 arrays: a 2d array for the grid, an array for the live cells, and an array
for the active cells.
Can't process rule B0.
Processes cells one by one.
Cells aren't just booleans
Possible improvements:
Cells also have an "Updated" value, they are updated only if they haven't
updated in the current tick, removing the need of array B as mentioned above
Instead of array B being the ones with live neighbors, array B could be the
cells without, and those check for rule B0.

Resources