Cellular automata on GPU with WGSL - parallel-processing

I am writing a physic simulation which is like a cellular automata. Each steps dependents on the previous one, but more precisely, each cell needs the state of itself and its direct neighbors to compute its new state.
I am using WGSL (WebGPU), and for the moment, for every step I call a dispatch (to ensure synchronization between steps), but it results in quite slow performances. I tried to performs the steps with a loop directly in the shader but I am unable to synchronize all work group between each step.
I tried using storageBarrier and workgroupBarrier, which does not work (synchronization does not occur). Nonetheless, if I only use two successive steps with one barrier between them, I increase performance by 2, meaning I am loosing most of the time during dispatch. And the result is almost perfect (meaning some synchronization did not happen but did not affect that much the result).
I read that it is impossible to synchronize all work groups in a single dispatch with the current specification of WGSL. But then I don't understand why is there a workgroupBarrier and a storageBarrier ??
How can I force all work groups to synchronize between each step of cellular automata ?
But more generally, I guess I am not the first person writing a cellular automata on the GPU with this direct neighbor dependency:
How to write fast cellular automata using GPU ?


How is Monte Carlo Tree Search implemented in practice

I understand, to a certain degree, how the algorithm works. What I don't fully understand is how the algorithm is actually implemented in practice.
I'm interested in understanding what optimal approaches would be for a fairly complex game (maybe chess). i.e. recursive approach? async? concurrent? parallel? distributed? data structures and/or database(s)?
-- What type of limits would we expect to see on a single machine? (could we run concurrently across many cores... gpu maybe?)
-- If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? & how can we reuse branches already played?
recursive approach? async? concurrent? parallel? distributed? data structures and/or database(s)
In MCTS, there's not much of a point in a recursive implementation (which is common in other tree search algorithms like the minimax-based ones), because you always go "through" a game in sequences from current game state (root node) till game states you choose to evaluate (terminal game states, unless you choose to go with a non-standard implementation using a depth limit on the play-out phase and a heuristic evaluation function). The much more obvious implementation using while loops is just fine.
If it's your first time implementing the algorithm, I'd recommend just going for a single-threaded implementation first. It is a relatively easy algorithm to parallelize though, there are multiple papers on that. You can simply run multiple simulations (where simulation = selection + expansion + playout + backpropagation) in parallel. You can try to make sure everything gets updated cleanly during backpropagation, but you can also simply decide to not use any locks / blocking etc. at all, there's already enough randomness in all the simulations anyway so if you lose information from a couple of simulations here and there due to naively-implemented parallelization it really doesn't hurt too much.
As for data structures, unlike algorithms like minimax, you actually do need to explicitly build a tree and store it in memory (it is built up gradually as the algorithm is running). So, you'll want a general tree data structure with Nodes that have a list of successor / child Nodes, and also a pointer back to the parent Node (required for backpropagation of simulation outcomes).
What type of limits would we expect to see on a single machine? (could we run concurrently across many cores... gpu maybe?)
Running across many cores can be done yes (see point about parallelization above). I don't see any part of the algorithm being particularly well-suited for GPU implementations (there are no large matrix multiplications or anything like that), so GPU is unlikely to be interesting.
If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? & how can we reuse branches already played?
In the most commonly-described implementation, the algorithm creates only one new node to store in memory per iteration/simulation in the expansion phase (the first node encountered after the Selection phase). All other game states generated in the play-out phase of the same simulation do not get any nodes to store in memory at all. This keeps memory usage in check, it means your tree only grows relatively slowly (at a rate of 1 node per simulation). It does mean you get slightly less re-usage of previously-simulated branches, because you don't store everything you see in memory. You can choose to implement a different strategy for the expansion phase (for example, create new nodes for all game states generated in the play-out phase). You'll have to carefully monitor memory usage if you do this though.

A couple of CUDA-performance questions

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.

How are massive cellular automata simulated?

Take the redstone from Minecraft as an example - it's basically a 15 state cellular automata with the following base rule:
Redstone -> Redstone, powered of level Max(neighbours)-1
and additional rules for various connected elements
Repeater, inactive -> Repeater, active, level 2 if its input is powered
Repeater, active, level 2 -> Repeater, active, level 1
Repeater, active, level 1 -> Repeater, inactive
Redstone, unpowered -> Redstone, powered if there is a neighbouring Repeater, level 1 or another source
(I've written more about how Minecraft stuff can be implemented using CAs: http://madflame991.blogspot.com/2011/10/cellular-automata-in-minecraft.html)
Now, my questions are: How would the game manage to update HUGE redstone contraptions? What data structure does it use? Is it really implemented as a cellular automata? If not, then what's your best guess?
P.S. I'm not asking anyone to take a peek at the actual source code, but just to speculate on how this technical thingie is achieved.
...and I'm posting this here, on SO, and not on gamedev because it's a CA question and not a gamedev related question.
Another possible approach to simulate mind-boggingly massive cellular automata (e.g. Game of Life in Game of Life) is to detect patterns (glider, glider generator, etc.) and to predict their future evolution and only compute the parts that are unknown (glider evolution).
Hashlife (1) can be what you are looking for by accelerating computations over really huge spaces.
The obvious way to do this is to divide the world into chunks (hey, Minecraft does that already!) and assign each chunk to a server. Each server is responsible for processing updates to that chunk, and for communicating with the servers responsible for neighboring chunks, propagating state to them.
In the case of a cellular automata like this, each chunk would have to communicate the current state of its edge cells to all neighboring chunks, and vice-versa, before it can increment the time step. Note that the communication overhead decreases with larger chunks, since the chunk area grows with O(n^2), while the perimeter only grows with O(n).
In reality, I suspect you'll find that it's not nearly that synchronous, and each chunk simulates the redstone inside it asynchronously, transmitting updates to neighboring chunks only when an event happens, and without trying to stay in sync with everyone else.

Can raymarching be accelerated under an SIMD architecture?

The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray.
So it would seem that SIMD would largely not be able to accelerate this; rather, MIMD would be required for acceleration.
Does this make sense? Or am I missing something(s)?
As stated already, you could probably get a speedup from implementing your
vector math using SSE instructions (be aware of the effects discussed
here - also for the other approach). This approach would allow the code
stay concise and maintainable.
I assume, however, your question is about "packet traversal" (or something
like it), in other words to process multiple scalar values each of a
different ray:
In principle it should be possible deferring the shading to another pass.
The SIMD packet could be repopulated with a new ray once the bare marching
pass terminates and the temporary result is stored as input for the shading
pass. This will allow to parallelize a certain, case-dependent percentage
of your code exploting all four SIMD lanes.
Tiling the image and indexing the rays within it in Morton-order might be
a good idea too in order to avoid cache pressure (unless your geometry is
strictly procedural).
You won't know whether it pays off unless you try. My guess is, that if it
does, the amount of speedup might not be worth the complication of the code
for just four lanes.
Have you considered using an SIMT architecture such as a programmable GPU?
A somewhat up-to-date programmable graphics board allows you to perform
raymarching at interactive rates (see it happen in your browser here).
The last days I built a software-based raymarcher for a menger sponge. At the moment without using SIMD and I also used no special algorithm. I just trace from -1 to 1 in X and Y, which are U and V for the destination texture. Then I got a camera position and a destination which I use to calculate the increment vector for the raymarch.
After that I use a constant value of iterations to perform, in which only one branch decides if there's an intersection with the fractal volume. So if my camera eye is E and my direction vector is D I have to find the smallest t. If I found that or reached a maximal distance I break the loop. At the end I have t - from that I calculate the fragment color.
In my opinion it should be possible to parallelize these operations by SSE1/2, because one can solve the branch by null'ing the field in the vector (__m64 / __m128), so further SIMD operations won't apply here. It really depends on what you raymarch/-cast but if you just calculate a fragment color from a function (like my fractal curve here is) and don't access memory non-linearly there are some tricks to make it possible.
Sure, this answer contains speculation, but I will keep you informed when I've parallelized this routine.
Only insofar as SSE, for instance, lets you do operations on vectors in parallel.

What algorithm can I implement to speed up some Cellular Automata simulations?

I am writing a ncurses based C.A. simulator for (nearly) any kind of C.A. which uses the Moore or Neumann neighborhoods.
With the current (hardcoded and most obvious [running state funcs]) the simulation runs pretty well; until the screen is filled with 'on' (or whatever active) cells.
So my question is:
Are there any efficient algorithms for handling at least life-like rules?
or Generations, Weighted life/generations...
it's generally nice to only run update passes in areas of the grid that had activity in the previous time step. if you keep a boolean lattice of "did i change this time?" for each pass, you only need to update cells within one radius of those with an "on" in the change lattice.
I think writing state machines is not as much algorhitms designing problem as is just problem how to write clean and "bug free" code. What are you probably looking for is implementation of cellular automata / state chart.
http://www.state-machine.com/ //<- no this is not coincidence
You might also try whit stackless python http://stackless.com/. It can be used for state machines or CA. Here http://members.verizon.net/olsongt/stackless/why_stackless.html#the-factory it is tutorial for stackless implementing factory process simulation
You could look into the HashLife algorithm and try to adapt its concept to whatever automata you are working on.
