Meaning of merge, phi, effectphi and dead in v8 terminology

Meaning of merge, phi, effectphi and dead in v8 terminology - v8

I’m trying to read the v8 source code (in particular the compiler part of it) to understand better the optimisation and reduction procedures (in order to look for bugs).
I ran into a few terms that are used in the comments but seem to be unexplained. The comment is this:
// Check if this is a merge that belongs to an unused diamond, which means
// that:
//
// a) the {Merge} has no {Phi} or {EffectPhi} uses, and
// b) the {Merge} has two inputs, one {IfTrue} and one {IfFalse}, which are
// both owned by the Merge, and
// c) and the {IfTrue} and {IfFalse} nodes point to the same {Branch}.
What do the terms Merge, Phi and EffectPhi mean? Also, does marking a node as “dead” mean that it will treated as redundant?
Thanks in advance.
The link to the above code is this:
https://chromium.googlesource.com/v8/v8.git/+/refs/heads/master/src/compiler/common-operator-reducer.cc

V8 developer here. As background knowledge, it helps to know that V8's "Turbofan" compiler uses the "SSA" (static single assignment) and "sea of nodes" concepts. There are various compiler textbooks and research papers that explain these in great detail. To answer your questions in short:
A "Merge" node merges two control nodes, i.e. two branches of control flow. You can think of it as the "opposite" of a Branch, or the equivalent of a Phi for control nodes. Control nodes are the mechanism that Turbofan's sea-of-nodes design uses to make sure nodes aren't reordered across certain control flow boundaries.
A "Phi" node merges the two (or more) possibilities for a value that have been computed by different branches. See https://en.wikipedia.org/wiki/Static_single_assignment_form for more.
An "EffectPhi" is a special version of a Phi node that's used for nodes on the "effect chain". The effect chain is the mechanism Turbofan uses to make sure that nodes' external effects (like memory loads and stores) aren't visibly reordered.
A "dead" node is one that's unreachable and can be eliminated. So it's "redundant" in the sense of "superfluous/unnecessary", but not in the sense of "the same as another node".

Related

How is Monte Carlo Tree Search implemented in practice

I understand, to a certain degree, how the algorithm works. What I don't fully understand is how the algorithm is actually implemented in practice.
I'm interested in understanding what optimal approaches would be for a fairly complex game (maybe chess). i.e. recursive approach? async? concurrent? parallel? distributed? data structures and/or database(s)?
-- What type of limits would we expect to see on a single machine? (could we run concurrently across many cores... gpu maybe?)
-- If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? & how can we reuse branches already played?

recursive approach? async? concurrent? parallel? distributed? data structures and/or database(s)
In MCTS, there's not much of a point in a recursive implementation (which is common in other tree search algorithms like the minimax-based ones), because you always go "through" a game in sequences from current game state (root node) till game states you choose to evaluate (terminal game states, unless you choose to go with a non-standard implementation using a depth limit on the play-out phase and a heuristic evaluation function). The much more obvious implementation using while loops is just fine.
If it's your first time implementing the algorithm, I'd recommend just going for a single-threaded implementation first. It is a relatively easy algorithm to parallelize though, there are multiple papers on that. You can simply run multiple simulations (where simulation = selection + expansion + playout + backpropagation) in parallel. You can try to make sure everything gets updated cleanly during backpropagation, but you can also simply decide to not use any locks / blocking etc. at all, there's already enough randomness in all the simulations anyway so if you lose information from a couple of simulations here and there due to naively-implemented parallelization it really doesn't hurt too much.
As for data structures, unlike algorithms like minimax, you actually do need to explicitly build a tree and store it in memory (it is built up gradually as the algorithm is running). So, you'll want a general tree data structure with Nodes that have a list of successor / child Nodes, and also a pointer back to the parent Node (required for backpropagation of simulation outcomes).
What type of limits would we expect to see on a single machine? (could we run concurrently across many cores... gpu maybe?)
Running across many cores can be done yes (see point about parallelization above). I don't see any part of the algorithm being particularly well-suited for GPU implementations (there are no large matrix multiplications or anything like that), so GPU is unlikely to be interesting.
If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? & how can we reuse branches already played?
In the most commonly-described implementation, the algorithm creates only one new node to store in memory per iteration/simulation in the expansion phase (the first node encountered after the Selection phase). All other game states generated in the play-out phase of the same simulation do not get any nodes to store in memory at all. This keeps memory usage in check, it means your tree only grows relatively slowly (at a rate of 1 node per simulation). It does mean you get slightly less re-usage of previously-simulated branches, because you don't store everything you see in memory. You can choose to implement a different strategy for the expansion phase (for example, create new nodes for all game states generated in the play-out phase). You'll have to carefully monitor memory usage if you do this though.

Algorithms for Minimum resource requirements

I have a question for which I have made some solutions, but I am not happy with the scalability. I'm looking for input of some different approaches / algorithms to solving it.
Problem:
Software can run on electronic controllers (ECUs) and requires
different resources to run a given feature. It may require a given
amount of storage or RAM or a digital or Analog Input or Output for
instance. If we have multiple features and multiple controller options
we want to find the combination that minimizes the hardware
requirements (cost). I'll simplify the resources to letters to
simplify the understanding.
Example 1:
Feature1(A)
ECU1(A,B,C)
First a trivial example. Lets assume that a feature requires 1 unit of resource A, and ECU has 1 unit of resources A, B and C available, it is obvious that the feature will fit in the ECU with resources B & C left over.
Example 2:
Feature2(A,B)
ECU2(A|B,B,C)
In this example, Feature 2 requires resources A and B, and the ECU has 3 resources, the first of which can be A or B. In this case, you can again see that the feature will fit in the ECU, but only if check in a certain order. If you assign F(A) to E(A|B), then F(B) to E(B) it works, but if you assign F(B) to E(A|B) then there is no resource left on the ECU for F(A) so it doesn't appear to fit. This would lead one to the observation that we should prefer non-OR'd resources first to avoid such a conflict.
An example of the above could be a an analog input could also be used as a digital input for instance.
Example 3
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C)
Now things are a little bit more complicated, but it is still quite obvious to a person that the feature will fit into the ECU.
My problems are simply more scaled up versions of these examples (i.e. multiple features per ECU with more ECUs to choose from.
Algorithms
GA
My first approach to this was to use a genetic algorithm. For a given set of features i.e. F(A,B,C,D), and a list of currently available ECUs find which single or combination of ECUs fit the requirements.
ECUs would initially be randomly selected and features checked they fitted and added to them. If a feature didn't fit another ECU was added to the architecture. A population of these architectures was created and ranked based on lowest cost of housing all the features. Architectures could then be mated in successive generations with mutations and such to improve fitness.
This approached worked quite well, but tended to get stuck in local minima (not the cheapest option) based on a golden example I had worked by hand.
Combinatorial / Permutations
My next approach was to work out all of the possible permutations (the ORs from above) for an ECU to see if the features fit.
If we go back to example 2 and expand the ORs we get 2 permutations;
Feature2(A,B)
ECU2(A|B,B,C) = (A,B,C), (B,B,C)
From here it is trivial to check that the feature fits in the first permutation, but not the second.
...and for example 3 there are 12 permutations
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C) = (A,B,A), (B,B,A), (C,B,A), (A,C,A), (B,C,A), (C,C,A), (A,B,C), (B,B,C), (C,B,C), (A,C,C), (B,C,C), (C,C,C)
Again it is trivial to check that feature 3 fits in at least one of the permutations (3rd, 5th & 7th).
Based on this approach I was also able to get a solution also, but I have ECUs with so many OR'd inputs that I have millions of ECU permutations which drastically increased the run time (minutes). I can live with this, but first wanted to see if there was a better way to skin the cat, apart from Parallelizing this approach.
So that is the problem...
I have more ideas on how to approach it, but assume that there is a fancy name for such a problem or the name of the algorithm that has been around for 20+ years that I'm not familiar with and I was hoping someone could point me in that direction to either some papers or the names of relevant algorithms.
The obvious remark of simply summing the feature resource requirements and creating a new monolithic ECU is not an option. Lastly, no, this is not in any way associated with any assignment or problem given by a school or university.
Sorry for the long question, but hopefully I've sufficiently described what I am trying to do and this peaks the interest of someone out there.
Sincerely, Paul.

Looks like individual feature plug can be solved as bipartite matching.
You make bipartite graph:
left side corresponds to feature requirements
right side corresponds to ECU subnodes
edges connect each left and right side vertixes with common letters
Let me explain by example 2:
Feature2(A,B)
ECU2(A|B,B,C)
How graph looks:
2 left vertexes: L1 (A), L2 (B)
3 right vertexes: R1 (A|B), R2 (B), R3 (C)
3 edges: L1-R1 (A-A|B), L2-R1 (B-A|B), L2-R2 (B-B)
Then you find maximal matching for unordered bipartite graph. There are few well-known algorithms for it:
https://en.wikipedia.org/wiki/Matching_(graph_theory)
If maximal matching covers every feature vertex, we can use it to plug feature.
If maximal matching does not cover every feature vertex, we are short of resources.
Unfortunately, this approach works like greedy algorithms. It does not know of upcoming features and does not tweak solution to fit more features later. Partially optimization for simple cases can work like you described in question, but in general it's dead end - only algorithm that accounts for every feature in whole feature set can make overall effective solution.
You can try to add several features to one ECU simultaneously. If you want to add new feature to given ECU, you can try all already assigned features plus candidate feature. In this case local optimum solution will be found for given feature set (if it's possible to plug them all to one ECU).

I've not enough reputation to comment, so here's what i wanted to propose for your problem:
Like GA there are some other Random Based approaches too e.g. Bayesian Apporaoch , Decision Tree etc.
In my opinion Decision Tree will suit your problem as it, against some input dataset/attributes, shows a path to each class(in your case ECUs) that helps to select right class/ECU. Train your system with some sample data sets so that it can decide right ECU for your actual data set/Features.
Check Decision Trees - Machine Learning for more information. Hope it helps!

Representing parallel operations in flowcharts

I need to do a flowchart of a hydraulic system featuring a temperature regulation module. However, temperature regulation is only activated during one part of the cycle. During this part, the system continues to perform other operations.
I want to represent that in my flowchart diagram. I need a way to tell when the parallel algorithm begins and when it ends within the main flowchart. How do I do that ?

Add two new flowchart nodes/operators fork and join.
Fork takes one control flow coming in, and produces two going out, similar to the diamond decision node in regular flowcharts. Unlike the decision node, the fork node passes control to both of its children, meaning that "parallel execution starts here".
Join takes two control flows in, and produces one control flow out. It means,
"wait for execution on both input branches to complete, then pass control to the output branch". This means "parallel execution stops here".
Now your flowchart can have serial and parallel sections; you can even have the parallel sections generate additional, nested parallelism.
Now you need a graphical representation of fork and join.
We represent the control flow graph of real computer programs with essentially a flowchart ("actions" and "decisions"). Because of the similarity of "fork" to "decision" (one input, two outputs) we chose to draw "fork" as an upward facing triangle (one input at the top, two outputs at the triangle base endpoints), and "join" as a downward facing triangle, with two inputs at the triangle base endpoints and one output at the peak of the downward facing triangle.
You can see automatically generated examples of this for C and COBOL programs. You might not think these contain parallelism... but, in fact, the langauge semantics of many languages is often unspecified, allowing parallel execution in effect (C function arguments according to the standard can be evaluated in arbitrary order, for example). We model this nondeterminism as "parallelism" because the net effect is the same.
People building industrial control software also need to express this. They
simply split the control flow line going from one flow graph item to another. See Sequential function charts in this document.
The most general "control flow" notation I know are Colored Petri Nets. These can model not just control flow, but data flow, synchronization, and arithmetic, which means you can model systems with very complex flows. You can model a regular flowchart directly with a CPN. CPNs also generalize finite state machines. What that means for programmers, is that if you don't know about CPNs, you should learn about these now. And you discover that "flowcharts" (as CPNs) are now useful again in discussing systems whose parts run asynchronously.

How do IaaS nodes communicate to form a cluster?

How do nodes communicate with each other, or how do they become aware of each other (in a decentralized manner) in an IaaS environment? As an example: this article about Akka on Google's IaaS describes a 1500+ decentralized cluster intercommunicating randomly. What is the outline of this process?

It would be quite long to explain how Akka cluster works in detail, but I can try to give an overview.
The membership set in Akka is esentially a highly specialized CRDT. Since talking about Vector Clocks itself would be a lengthy discussion, I will use the analogy of git-like repositories.
You can imagine every Akka node maintaining its own repository where HEAD points to the current state of the cluster (known by that node). When a node introduces a change, it branches off, and starts to propagate the change to other nodes (this part is what is more or less random).
There are certain changes which we call monotonic which in the git analogy would mean that the branch is trivially mergeable. Those changes are just merged by other nodes as they receive them and they will then propagate the merge commit to others and eventually everything stabilizes (HEAD points to the same content).
There are other kind of changes that are not trivial to merge (non-monotonic). The process then is that a node first sends around a proposal: "I want to make this non-trivial change C". This is needed because the other nodes need to be aware of this pending "complex" change and prepare themselves. This is disseminated among the nodes until everyone receives it. Now we are at the state where "Everyone knows that someone proposed to make the change C", but this is not enough, since no one is actually aware that there is an agreement yet.
Therefore there is another "round", where nodes start to propagate the information "I, node Y, are aware of the fact that change C has been proposed". Eventually one or more nodes become aware that there is an agreement (this is more or less a distributed acknowledgement protocol). So the state now is "At least one node knows that every node knows that the change C has been proposed". This is (partly) what we refer to as convergence. At this point the node (or nodes) that are aware of the agreement will make the merge and propagate it.
Please note that I highly simplified the explanation here, obviously the devil (and scaling) is in the details :)

How to handle multiple optimal edit paths implementing Needleman-Wunsche algorithm?

Trying to implement Needleman-Wunsche algorithm for biological sequences comparison. In some circumstances there exist multiple optimal edit paths.
What is the common practice in bio-seq-compare tools handling this? Any priority/preferences among substitute/insert/deletion?
If I want to keep multiple edit paths in memory, any data structure is recommended? Or generally, how to store paths with branches and merges?
Any comments appreciated.

If two paths are have identical scores, that means that the likelihood of them is the same no matter which kinds of operations they used. Priority for substitutions vs. insertions or deletions has already been handled in getting that score. So if two scores are the same, common practice is to break the tie arbitrarily.
You should be able to handle this by recording all potential cells that you could have arrived at the current one from in your traceback matrix. Then, during traceback, start a separate branch whenever you come to a branching point. In order to allow for merges too, store some additional data about each cell (how will depend on what language you're using) indicating how many different paths left from it. Then, during traceback, wait at a given cell until that number of paths have arrived back at it, and then merge them into one. You can either be following the different branches with true parallel processing, or by just alternating which one you are advancing.

Unless you have an a reason to prefer one input sequence over the other in advance it should not matter.
Otherwise you might consider seq_a as the vertical axis and seq_b as the horizontal axis then always choose to step in your preferred direction if there is a tie to break ... but I'm not convincing myself there is any difference to the to alignment assuming one favors one of the starting sequences over the other

As a lot of similar algorithms, Needleman-Wunsche one is just a task of finding the shortest way into a graph (square grid in this case). So I would use A* for defining a sequence & store the possible paths as a dictionary with nodes passes.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio