How is Monte Carlo Tree Search implemented in practice - algorithm

I understand, to a certain degree, how the algorithm works. What I don't fully understand is how the algorithm is actually implemented in practice.
I'm interested in understanding what optimal approaches would be for a fairly complex game (maybe chess). i.e. recursive approach? async? concurrent? parallel? distributed? data structures and/or database(s)?
-- What type of limits would we expect to see on a single machine? (could we run concurrently across many cores... gpu maybe?)
-- If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? & how can we reuse branches already played?

recursive approach? async? concurrent? parallel? distributed? data structures and/or database(s)
In MCTS, there's not much of a point in a recursive implementation (which is common in other tree search algorithms like the minimax-based ones), because you always go "through" a game in sequences from current game state (root node) till game states you choose to evaluate (terminal game states, unless you choose to go with a non-standard implementation using a depth limit on the play-out phase and a heuristic evaluation function). The much more obvious implementation using while loops is just fine.
If it's your first time implementing the algorithm, I'd recommend just going for a single-threaded implementation first. It is a relatively easy algorithm to parallelize though, there are multiple papers on that. You can simply run multiple simulations (where simulation = selection + expansion + playout + backpropagation) in parallel. You can try to make sure everything gets updated cleanly during backpropagation, but you can also simply decide to not use any locks / blocking etc. at all, there's already enough randomness in all the simulations anyway so if you lose information from a couple of simulations here and there due to naively-implemented parallelization it really doesn't hurt too much.
As for data structures, unlike algorithms like minimax, you actually do need to explicitly build a tree and store it in memory (it is built up gradually as the algorithm is running). So, you'll want a general tree data structure with Nodes that have a list of successor / child Nodes, and also a pointer back to the parent Node (required for backpropagation of simulation outcomes).
What type of limits would we expect to see on a single machine? (could we run concurrently across many cores... gpu maybe?)
Running across many cores can be done yes (see point about parallelization above). I don't see any part of the algorithm being particularly well-suited for GPU implementations (there are no large matrix multiplications or anything like that), so GPU is unlikely to be interesting.
If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? & how can we reuse branches already played?
In the most commonly-described implementation, the algorithm creates only one new node to store in memory per iteration/simulation in the expansion phase (the first node encountered after the Selection phase). All other game states generated in the play-out phase of the same simulation do not get any nodes to store in memory at all. This keeps memory usage in check, it means your tree only grows relatively slowly (at a rate of 1 node per simulation). It does mean you get slightly less re-usage of previously-simulated branches, because you don't store everything you see in memory. You can choose to implement a different strategy for the expansion phase (for example, create new nodes for all game states generated in the play-out phase). You'll have to carefully monitor memory usage if you do this though.

Related

Estimating strong scaling efficiency when single-node run is not possible

I have implemented an OpenMP/MPI hybrid parallel algorithm, and would like to measure its strong-scaling parallel efficiency. For this, I would have to calculate speed-up S=t(1)/t(N), and then the efficiency E=S/N.
Background: Having done some analysis, I was able to show that the peak efficiency of the algorithm could be expected at a problem size, at which the single node of my benchmark cluster cannot house the data required.
Possible solutions: I can either:
calculate speed-up using the smallest node-count, at which the data can be housed e.g. at 4 nodes => S=t(4)/t(N), or,
calculate the theoretical single-node time-to-solution t(1) by extrapolation, and then use that value as reference.
Questions:
Which approach is better and why?
If I use the first approach, can I, strictly speaking, even refer to it as strong-scaling parallel efficiency, seeing as it doesn't conform to the definition provided above?
Bonus question: When we measure t(1), should we run the algorithm with simulated communication calls (i.e. by calling mpirun -n 1 ./my_benchmark_program), or should we rather call a version of the program which performs no communication at all (i.e. ./my_openmp_only_benchmark_program)?
I hope this post is clear, please ask for clarification if it isn't. Any help will be greatly appreciated. Thanks in advance.
There are various problems with the classical definition of speedup if you are using MPI. The single processor case involves no communication, while the two-processor one does, so there is overhead in the t(2) case and it will always be less than twice as fast. This is even worse if you have a multicore/multinode setup, where up to 16 (or so) processes will run on a single node, so t(17) will suddenly be much slower because it starts involving a second node.
This means you can not simply apply the textbook formulas. You need to explain how you are doing your scalability study. For instance: one process per node until the number of processes == number of nodes, then start putting multple processes on each node, et cetera.
The fact that the single-process case does not fit in memory is then a minor hiccup: you start with a base case of multiple processes, and document that fact, plus your reasoning for the base case that you actually used.

Does the Barnes Hut Tree need to be recreated each iteration of a loop

I'm coding an application which is required to perform an n-body simulation between a few hundred particles which are constantly in motion. The application has real time requirements and thus the algorithm performing the simulation needs to be fast.
I've done a fair amount of research on the matter and have come to the conclusion that the Barnes Hut algorithm would be most suitable for my needs, it seems very efficient for large particle sets.
http://arborjs.org/docs/barnes-hut gave a very clear explanation on how the algorithm worked, but as the title implies, I'd like to know whether the tree needs to be recreated for each iteration, considering that the particles used in the simulation are always dynamically in motion. And if the tree does need to be recreated, how does one do it in the most efficient (in terms of processing power and memory) way.
Usually with motion based indexes there is no "update" for the index after movement has occurred and you must rebuild the entire index.
The Barnes Hut Tree is the same and will have to be rebuilt. Here is an example I found online with a code outline of the process.
This is one of the reason so much effort has gone into build optimizations for things like the KD-Tree, and I'm sure the Barnes Hut has the same. Also, I'm sure there is research on dynamically updating, but most of the time these implementations are much harder than a simply rebuilding.
Pretty late on answering this question. Guess I had never worked on this problem then. But not that I have been looking into it for sometime now, I can share some insight. What #greedybuddha is saying is mostly right, but they are tricks and techniques to prevent creating the tree entirely every time step. As usual these techniques with their own overheads (memory footprint etc) and necessary trade-offs need to be made. Also, it may not be possible to apply them in all situations.
Allocate enough memory upfront to handle octree of a given depth. Deallocate the memory only when all timessteps have finished. For example suppose for all n-body input data sets you are going to use, you know that your Octree will never go beyond a depth of say 10. In that case it is easy to figure out how many maximum nodes you octree might have(usually its a geometric progression sum). By doing a bit of bookkeeping(valid child indexes of each node) for every octree node, it is easy to reuse this buffer to fill in octree nodes without needing to allocate/deallocate octree nodes every timestep. Clearly the likely overhead here is that you might be wasting lot of memory but then you gain performance by not doing allocation/deallocation every timestep
Usual way of going about bounding the number of octree levels is to allow multiple bodies per leaf node (last level of octree or smallest cell size of octree grid)
Create a new Octree only when needed: It is possible to think about a situation where octree doesn't change for some series (burst) of timesteps and then it changes and the pattern repeats. This can happen only when bodies positions change insignificantly over a timestep burst such that octree structure remains the same during that burst. And in what situation will bodies move slowly - when the chosen timestep size is pretty small, forces exerted + initial momentum on bodies is small. How to dynamically figure out such bursts is a difficult problem. This is a more tricky technique and I dont know whats the easy way to do this. This requires some insight of timestep granularity, initial velocity/acceleration of the bodies and kind of forces the bodies are dealing with.

How to Compare the Running Times of Two Data Structures' Operations

I want to compare the performance of two search trees of integers (an AVL tree vs a RedBlack tree). So how should I design/engineer the tests to accomplish this ? For instance, let's consider the insert operation, what steps should I follow in order to state that on average this operation is faster in the RB case ? Should I time inserting just one element (assuming the trees are pre-populated) or should I time a sequence of insertions ? Also what considerations should I take to correctly measure CPU time accurately ?
Thanks in advance.
This is a really broad question, and as such, I don't think you should be hoping for anybody to get on here and give you the one final correct answer regarding how to measure performance. That being said...
First, you should develop a suite of tests. Two popular techniques exist for doing this: monitor a real-world sequence of operations done by an application (so, find some open source application that uses either an AVL or RB tree, and add some code to print out the sequence of operations it performs) or create such a stream of operations analytically (or synthetically) to target any number of cases (the average usage, particular kinds of abnormal or otherwise unusual usage, random usage, etc.). The more of these kinds of traces you get to test, the better.
Once you have your set of traces to test, you need to develop a driver to do the evaluation. The driver should be simple, the same for both AVL and RB trees (I think that in this case, this shouldn't be a problem; both present the same interface to users, differing only in terms of internal implementation details). The driver should be able to reproduce the usage recorded in your trace sets efficiently and cause the traced operations to be carried out on your data structures. One thing I like to do is to include a third "dummy" candidate that does nothing; this way, I can see how much of an influence the processing of traces is exerting on overall performance.
Each trace should be executed many, many times. You can formalize this somewhat (to reduce statistical uncertainty to within known bounds), but a rule of thumb is that the order of your error will shrink according to 1/sqrt(n), where n is the number of trials. In other words, by running each trace 10,000 times instead of 100 times, you will get errors in the average that are 10x smaller. Record all values; things to look for are the mean, median, mode(s), etc. For each run, try to keep the system conditions the same; no other programs running, etc. To help eliminate spurious results due to external factors changing, you can cull the bottom and top 10% of outliers...
Now, simply compare the data sets. Perhaps what you care most about is the average time the trace takes? Perhaps the worst? Maybe what you really care about is consistency; is the standard deviation big or small? You should have enough data to compare the results for a given trace executed on both test structures; and for different traces, it might make more sense to look at different figures (for instance, if you created a synthetic benchmark that should be the worst case for RB trees, you might ask how badly RB and AVL trees did, whereas you might not care about this for another trace representing the best case for AVL trees, etc.)
Timing on the CPU can be a challenge in its own right. You'll need to ensure that the resolution of your timer is sufficient for measuring your events. clock() and gettimeofday() functions - and others - are popular choices for recording the time of events. If your traces finish too quickly, you can get the aggregate time for several trials (so that if your timer supports microsecond timing and your traces finish in 10 microseconds, you can measure 100 executions of the trace instead of 1, and get time values on 10s of milliseconds, which should be accurate).
Another potential pitfall is providing the same execution environment each time. In between trace runs, at the very least, you might consider techniques for ensuring that you start with a clean cache. Either that, or don't time the first execution, or understand that this result might be culled when you eliminate outliers. It might be safer to just reset the cache (by manipulating every element of some large array, for instance in between executions of traces), since code A might benefit from having some of the values in cache while code B might suffer.
These are a few of the things you might consider when doing your own performance evaluation. Other tools - like PAPI and other profilers, for instance - can measure certain events - cache hits/misses, instructions, etc. - and this information can allow for much richer comparisons than simple comparisons of wall-clock run time.
Measuring CPU time accurately can be very tricky depending on your particular programming language, implementation, etc. For example, with Java's JIT compilation, the results can be extremely different depending on how much you've run the code before now!
Can you give more detail about your situation?

Database for brute force solving board games

A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.

Algorithms for realtime strategy wargame AI

I'm designing a realtime strategy wargame where the AI will be responsible for controlling a large number of units (possibly 1000+) on a large hexagonal map.
A unit has a number of action points which can be expended on movement, attacking enemy units or various special actions (e.g. building new units). For example, a tank with 5 action points could spend 3 on movement then 2 in firing on an enemy within range. Different units have different costs for different actions etc.
Some additional notes:
The output of the AI is a "command" to any given unit
Action points are allocated at the beginning of a time period, but may be spent at any point within the time period (this is to allow for realtime multiplayer games). Hence "do nothing and save action points for later" is a potentially valid tactic (e.g. a gun turret that cannot move waiting for an enemy to come within firing range)
The game is updating in realtime, but the AI can get a consistent snapshot of the game state at any time (thanks to the game state being one of Clojure's persistent data structures)
I'm not expecting "optimal" behaviour, just something that is not obviously stupid and provides reasonable fun/challenge to play against
What can you recommend in terms of specific algorithms/approaches that would allow for the right balance between efficiency and reasonably intelligent behaviour?
If you read Russell and Norvig, you'll find a wealth of algorithms for every purpose, updated to pretty much today's state of the art. That said, I was amazed at how many different problem classes can be successfully approached with Bayesian algorithms.
However, in your case I think it would be a bad idea for each unit to have its own Petri net or inference engine... there's only so much CPU and memory and time available. Hence, a different approach:
While in some ways perhaps a crackpot, Stephen Wolfram has shown that it's possible to program remarkably complex behavior on a basis of very simple rules. He bravely extrapolates from the Game of Life to quantum physics and the entire universe.
Similarly, a lot of research on small robots is focusing on emergent behavior or swarm intelligence. While classic military strategy and practice are strongly based on hierarchies, I think that an army of completely selfless, fearless fighters (as can be found marching in your computer) could be remarkably effective if operating as self-organizing clusters.
This approach would probably fit a little better with Erlang's or Scala's actor-based concurrency model than with Clojure's STM: I think self-organization and actors would go together extremely well. Still, I could envision running through a list of units at each turn, and having each unit evaluating just a small handful of very simple rules to determine its next action. I'd be very interested to hear if you've tried this approach, and how it went!
EDIT
Something else that was on the back of my mind but that slipped out again while I was writing: I think you can get remarkable results from this approach if you combine it with genetic or evolutionary programming; i.e. let your virtual toy soldiers wage war on each other as you sleep, let them encode their strategies and mix, match and mutate their code for those strategies; and let a refereeing program select the more successful warriors.
I've read about some startling successes achieved with these techniques, with units operating in ways we'd never think of. I have heard of AIs working on these principles having had to be intentionally dumbed down in order not to frustrate human opponents.
First you should aim to make your game turn based at some level for the AI (i.e. you can somehow model it turn based even if it may not be entirely turn based, in RTS you may be able to break discrete intervals of time into turns.) Second, you should determine how much information the AI should work with. That is, if the AI is allowed to cheat and know every move of its opponent (thereby making it stronger) or if it should know less or more. Third, you should define a cost function of a state. The idea being that a higher cost means a worse state for the computer to be in. Fourth you need a move generator, generating all valid states the AI can transition to from a given state (this may be homogeneous [state-independent] or heterogeneous [state-dependent].)
The thing is, the cost function will be greatly influenced by what exactly you define the state to be. The more information you encode in the state the better balanced your AI will be but the more difficult it will be for it to perform, as it will have to search exponentially more for every additional state variable you include (in an exhaustive search.)
If you provide a definition of a state and a cost function your problem transforms to a general problem in AI that can be tackled with any algorithm of your choice.
Here is a summary of what I think would work well:
Evolutionary algorithms may work well if you put enough effort into them, but they will add a layer of complexity that will create room for bugs amongst other things that can go wrong. They will also require extreme amounts of tweaking of the fitness function etc. I don't have much experience working with these but if they are anything like neural networks (which I believe they are since both are heuristics inspired by biological models) you will quickly find they are fickle and far from consistent. Most importantly, I doubt they add any benefits over the option I describe in 3.
With the cost function and state defined it would technically be possible for you to apply gradient decent (with the assumption that the state function is differentiable and the domain of the state variables are continuous) however this would probably yield inferior results, since the biggest weakness of gradient descent is getting stuck in local minima. To give an example, this method would be prone to something like attacking the enemy always as soon as possible because there is a non-zero chance of annihilating them. Clearly, this may not be desirable behaviour for a game, however, gradient decent is a greedy method and doesn't know better.
This option would be my most highest recommended one: simulated annealing. Simulated annealing would (IMHO) have all the benefits of 1. without the added complexity while being much more robust than 2. In essence SA is just a random walk amongst the states. So in addition to the cost and states you will have to define a way to randomly transition between states. SA is also not prone to be stuck in local minima, while producing very good results quite consistently. The only tweaking required with SA would be the cooling schedule--which decides how fast SA will converge. The greatest advantage of SA I find is that it is conceptually simple and produces superior results empirically to most other methods I have tried. Information on SA can be found here with a long list of generic implementations at the bottom.
3b. (Edit Added much later) SA and the techniques I listed above are general AI techniques and not really specialized to AI for games. In general, the more specialized the algorithm the more chance it has at performing better. See No Free Lunch Theorem 2. Another extension of 3 is something called parallel tempering which dramatically improves the performance of SA by helping it avoid local optima. Some of the original papers on parallel tempering are quite dated 3, but others have been updated4.
Regardless of what method you choose in the end, its going to be very important to break your problem down into states and a cost function as I said earlier. As a rule of thumb I would start with 20-50 state variables as your state search space is exponential in the number of these variables.
This question is huge in scope. You are basically asking how to write a strategy game.
There are tons of books and online articles for this stuff. I strongly recommend the Game Programming Wisdom series and AI Game Programming Wisdom series. In particular, Section 6 of the first volume of AI Game Programming Wisdom covers general architecture, Section 7 covers decision-making architectures, and Section 8 covers architectures for specific genres (8.2 does the RTS genre).
It's a huge question, and the other answers have pointed out amazing resources to look into.
I've dealt with this problem in the past and found the simple-behavior-manifests-complexly/emergent behavior approach a bit too unwieldy for human design unless approached genetically/evolutionarily.
I ended up instead using abstracted layers of AI, similar to a way armies work in real life. Units would be grouped with nearby units of the same time into squads, which are grouped with nearby squads to create a mini battalion of sorts. More layers could be use here (group battalions in a region, etc.), but ultimately at the top there is the high-level strategic AI.
Each layer can only issue commands to the layers directly below it. The layer below it will then attempt to execute the command with the resources at hand (ie, the layers below that layer).
An example of a command issued to a single unit is "Go here" and "shoot at this target". Higher level commands issued to higher levels would be "secure this location", which that level would process and issue the appropriate commands to the lower levels.
The highest level master AI is responsible for very board strategic decisions, such as "we need more ____ units", or "we should aim to move towards this location".
The army analogy works here; commanders and lieutenants and chain of command.

Resources