Data structure for program scope? - data-structures

I'm trying to parse through an AST of a program for a made up language, to be specific I'm trying to emulate the scope, so you enter a function for example and you push a new scope, and when the function is finished being visited by the visitor, it pops the scope. One important aspect is that when we push a new scope, there is a pointer currentScope that is set, which points to the scope we're currently looking at. When we pop the scope, this currentScope is set to be the "outer":
class Scope:
outer : Scope
inner : Scope
This is going to happen in multiple passes, but the first pass it's important that it constructs the general tree of scopes.
The question I'm asking though is how can I traverse this tree in the same order it was created?
For example:
{ // global scope
{ // a
{ // aa
}
{ // ab
}
}
{ // b
}
}
When I pass over the exact same set of nodes again, in theory they will give me the same tree of scope, but I want to preserve all of the data we collect and store each scope over each pass. In other words, when the second or third pass happens over the AST, when we visit a, currentScope = a, and when we visit aa, then currentScope = aa. Is this possible? I'm really confused with this idea, the whole recursive-y aspect is really messing with my head and I can't seem to figure out how to do this.
Here's what I've tried:
class Scope
outer : Scope
inner : Scope
siblings : []Scope
Scope(outer):
this.outer = outer
push_idx = 0
push_scope()
// set global scope
if current is null
global = new Scope(null)
current = global
return
if current.inner is not null:
// first pass over the AST
if current_pass == 0:
new_scope = new Scope(current)
current.siblings.push(new_scope)
current = new_scope
return
current = current.siblings[push_idx++]
else:
new_scope = new Scope(current)
current.inner = new_scope
current = current.inner
pop_scope()
push_idx = 0
current = current.outer
Though the order doesn't seem correct, and I'm fairly certain this is the wrong approach to this.

A data structure that's often used to track scope inside a compiler is a spaghetti stack, which is essentially a linked list data structure where each scope is a node storing a pointer to its parent scope. Whenever you enter a scope, you create a new node, point it to the enclosing scope, then store the node somewhere in the AST associated with that scope. As you walk the AST, your AST walker stores a pointer to the current scope node. When you enter a scope, you create a new scope node as described above. When you leave a scope, you change the pointer to point to the parent of the current scope. This ends up building a large inverted tree structure where each scope can trace its scope chain up to the root scope - the spaghetti stack.

"Scope" is really a region of the program where all the identifiers in that region have constant meaning.
If your language has pure nested lexical scopes, you can model the set of scopes with a tree ("spaghetti" stack if you like), where each leaf contains a mapping from symbols introduced in that scope to their corresponding type information. This is what is classically taught in compiler classes.
But with more complex scoping rules (namespaces, using constructs, ...) in general you may need a graph whose leaves are the individual scopes with graph arcs representing relations between the scopes. Yes, one of those relations is usually "lexical parent". Other may include "inherits from", etc. You may also find that a name in a leaf mapping may by a type, it may in fact be an access path to an arbitrary other (leaf) scope in the graph.
(I build generic program analysis tool infrastructure [see bio]. We defined a graph-style symbol table API to support all the different scoping rules we have encountered. An interesting class of arc is "inherits from with priority N" for arbitary integer N; this lets us easily model ordered multiple inheritance offered by C++).

Perhaps you should give some thought to Segment tree:
each segment presents a scope (beginning of scope | ending of scope).
The tree structure would be according to the code hierarchy.
The tree's leafs would be the keywords in each scope.
Good luck!

Related

Converting collections performance

In my code I working with different types of collections and often converting one to another. I do it easily calling toList, toVector, toSet, toArray functions.
Now I am interested in performance of this operations. I find information about length, head, tail, apply performance in documentation. What actually happens when I call functions(toList, toVector, toSet, toArray) on List, Set, Array and Vector implementation in scala?
P.S. Question is only about standard scala collections which is immutable.
Well my advice would be: look yourself into the source code ! For instance, method toSet is defined as follow in the TraversableOnce trait (annotated by myself) :
def to[Col[_]](implicit cbf: CanBuildFrom[Nothing, A, Col[A #uV]]): Col[A #uV] = {
val b = cbf() //generic way to build the collection, if it would be a List, it would create an empty List
b ++= seq // add all the elements
b.result() //transform the result to the target collection
}
So it means that the toSet method has a performance of O(N) since you traverse all the list once! I believe that all the collections inheriting this trait are using this implementation.

How to build a Control Flow Graph (CFG) from a JSON object (AST)

I want to build a control flow graph (CFG) from an AST given in JSON format. So this AST is automatically created in TouchDevelop against each script. And since TouchDevelop is not Object Oriented programming, can I still use the Visitor pattern? Any useful pointers would be appreciated.
Update1: My problem is that I don't understand where to start. From the internet, I am supposed to use Visitor Pattern to walk through AST to visit each node and collect information. And from there, I can build a CFG and then do Data Flow analysis. But there are two issues:
1) AFAIK, I need object oriented programming model to use Visitor Pattern, (I might be wrong) which TouchDevelop is NOT.
2) The AST as given below is not in AST format as I find on the internet. It's in JSON format. I think I could parse the JSON to convert it into the desired AST structure, but I am not so sure.
Source code of a sample script
meta version "v2.2,nothing";
meta name "DivideByZero";
//
meta platform "current";
action main() {
(5 / 0)→post_to_wall;
}
Resulting AST (JSON formatted) is given below:
{
"type":"app",
"version":"v2.2,nothing",
"name":"DivideByZero",
"icon":null,
"color":null,
"comment":"",
"things":[
{
"type":"action",
"name":"main",
"isEvent":false,
"outParameters":[
],
"inParameters":[
],
"body":[
{
"type":"exprStmt",
"tokens":[
{
"type":"operator",
"data":"("
},
{
"type":"operator",
"data":"5"
},
{
"type":"operator",
"data":"/"
},
{
"type":"operator",
"data":"0"
},
{
"type":"operator",
"data":")"
},
{
"type":"propertyRef",
"data":"post to wall"
}
]
}
],
"isPrivate":false
}
]
}
I didn't find a reference to the TouchDevelop scripting language yet. I don't know what you can do with it and what you can't.
You don't necessarily have to use a visitor pattern. Visitor patterns is the method used when your abstract syntax tree is described by instances of nodes from a class hierarchy. The conversion from AST to CFG is more general than that. An abstract syntax tree is an abstract data type, a special case of tree. Like any other abstract data type, it can be represented in many ways. It doesn't matter how you do it, but the only thing you need to do, is to iterate over this tree. And the iteration method you have depend on the language you are using. This should answer your question 2/: a JSON string may be a representation of an AST. The AST is an abstract data type while the JSON string is an implementation of this abstract data type.
In JSON, you can have values, arrays or sets of (key,value) associations. I can probably assume that your AST nodes will be the set of (key,value) associations. I assume as well, that each of these nodes have a key named type which allow you to identify what kind of node it is.
If I am correct, this answer the question: why you don't need a visitor pattern. A visitor pattern allows us to extract the type of each node. (this is what is called "double dispatch") But here, you don't need it since the type of each node is encoded in the type field.
Typically, the conversion from AST to CFG is done by using a set of functions: one function for each type of node in the AST. Each of these functions need to write the CFG part associated with the node it takes as parameter. It will recursively call conversion functions for the children nodes. (This is what a visitor pattern would do, in case of OO-AST)
For instance, you'll have a function ConvertNode. This function will read the type field of a node, and call the according conversion function with the node. Your root node have type app. Then the ConvertNode function will dispatch to the ConvertApp function. ConvertApp will read some fields like name and will iterate over the things array and call ConvertNode for each of these nodes. Then again ConvertNode will dispatch the call to the appropriate function.
The way those conversion functions will be called follow exactly the AST structure. How the CFG is created when you iterate over the tree is dependent of the input language. Each of the conversion function may return a constructed node or transition of your CFG to allow the caller to reuse it. Or the caller might pass a node or transition as parameter to allow the called function to continue the construction from there. You are free to choose the appropriate way to build the CFG and to break the general rules: there may clever ways to simplify the construction.

One-way sync of two hierarchies

I'm hoping to write an algorithm to synchronize two hierarchical structures. These structures could be object graphs, data stored in relational database tables, etc (even two different structures, so long as they have comparable keys). The synchronization will be one-way, i.e., one structure will be the prototype, and the other will be modified to match.
Let's say we have a sync function. It would need to accept the following:
objA -- the prototype
objB -- the object to be modified
keyA -- key generating function for objA
keyB -- key generating function for objB
addB -- function to create an objB (returns id of new objB)
setB -- function to update objB
remB -- function to delete an objB
parB -- id of objB's parent -- this is passed to addB for context
So we have this:
let sync (objA:'a) (objB:'b) (keyA:'a -> 'k) (keyB:'b -> 'k)
(addB:'p * 'a -> 'p) (setB:'a * 'b -> unit) (remB:'b -> unit)
(parB:'p) = ...
Now here's where I'm having trouble. 'a and 'b are hierarchical, so the function needs to know which properties of 'a and 'b it should traverse (once it compares their keys and decides they match thus far and should be further traversed). For these "child" properties, it needs all the same arguments passed to sync, but for their respective types.
This is when it became apparent this is a data structure problem. How can I chain together this information such that the root object can be passed to sync and it can traverse the graphs downward? My initial thought was to incorporate all of the arguments into a class, which would have a children property (a ResizeArray of the same type). But with various properties having different types, I couldn't figure out a way to make it work, short of throwing types out the window and making most or all of the type arguments obj.
So here are my questions:
Is there a well-established method for doing this already (I haven't been able to find anything)
What data structure might I use to encapsulate the data necessary to make this work?
I've tried my best to explain this thoroughly, but if anything remains unclear, please ask, and I'll try to provide better information.
I'm sure this is oversimplifying it but here's my idea.
If this is a DAG you could do a breadth-first traversal of objA. When you enqueue a node from objA include objB and any other information you need (tuple). Then when you dequeue you fix up objB.
You could use a discriminated union to handle different child types in your enqueueing.
Generate diffgrams from the two data structures and map the transforms to the transformed problem.

State of object after std::move construction

Is it legal/proper c++0x to leave an object moved for the purpose of move-construction in a state that can only be destroyed? For instance:
class move_constructible {...};
int main()
{
move_constructible x;
move_constructible y(std::move(x));
// From now on, x can only be destroyed. Any other method will result
// in a fatal error.
}
For the record, I'm trying to wrap in a c++ class a c struct with a pointer member which is always supposed to be pointing to some allocated memory area. All the c library API relies on this assumption. But this requirement prevents to write a truly cheap move constructor, since in order for x to remain a valid object after the move it will need its own allocated memory area. I've written the destructor in such a way that it will first check for NULL pointer before calling the corresponding cleanup function from the c API, so that at least the struct can be safely destroyed after the move.
Yes, the language allows this. In fact it was one of the purposes of move semantics. It is however your responsibility to ensure that no other methods get called and/or provide proper diagnostics. Note, usually you can also use at least the assignment operator to "revive" your variable, such as in the classical example of swapping two values.
See also this question

OO Design Question -- Parent/Child(ren) -- Circular?

I'm fairly new to the OO design process, so please bear with me....
I have two entities that I need to model as classes, call them Parent and Child (it's close enough to the actual problem domain). One Parent will have one or more Children -- I have not interest, in this application, in childless Parents.
Where my brain is going out to lunch is on the fact that I need to be able to find either from the other. In my database I can implement this with a normal foreign key relationship, and the set-based nature of SQL makes it easy to find all Children for a given Parent, or the Parent for a given Child. But as objects...?
I think that the Parent should carry a collection (list, whatever) of Children. I also think that each Child should carry a reference to its Parent. The circular nature of the references, however, is making my head hurt.
Am I:
On the right track?
Completely off base? If so, what should I do differently?
This will almost certainly be implemented in VB.NET, but I'm a ways from cutting code yet.
Edit after 8 answers:
Thanks all. It was hard to pick just one answer to accept.
To clarify a couple of things that were questioned in the answers:
Parent and Child are very different
entities--there's not inheritance
relationship at all. I chose the
names that I did because they're
really very close to the real-world
problem domain, and now see that it's
a source of confusion from an OO
perspective.
The hierarchy is only one level deep--Children will never have Children
within the application.
Thanks again.
The circular references are fine and absolutely standard when creating a tree structure. HTML's Document Object Model (DOM), for example, has the parent and child properties on every node in a DOM tree:
interface Node {
// ...
readonly attribute Node parentNode;
readonly attribute NodeList childNodes;
// ...
}
Sounds like you're on the right track to me. As per your domain model, parents have children and children have parents. You may need to reference each from the other.
There is nothing wrong with circular references, you just have to be careful about what you do with them. Where you'll run into trouble is managing your entities on the server side in an automated fashion when you load them from the database. For example, you fetch a Child object from the database with a query. Do you include the parent information? Do you include the parent's children?
ORM tools like Lightspeed or Microsoft's Entity Framework generally deal with this using "lazy loading" directives. They'll fetch what you need at first (so, when you fetch a Child, it just gets the Child properties and the parent's ID). If later, you dereference the Parent, it goes and fetches the Parent properties and instantiates the Parent object. If later still, you access it's Children collection, it then goes and fetches the relevant child information and creates Child objects for that collection. Until you need them though, it doesn't populate it.
I think it's reasonable to want to be able to traverse the object graph in this way. It's hard to know if you have a justifiable reason for it from your post, but I don't think the references in and of themselves prove a bad design.
I believe you're on the right track. Why is the circular nature of the references making your head hurt? What is the fundamental issue you're having with a Parent having references to its children, and a Child having a reference to its parent?
Are you talking about a class hierarchy, where the parent class knows about its child classes?
You should avoid this at all costs.
By default, a child class knows all about a parent class, because it is an instance of the parent class. But to have a parent class know about its child classes requires that the child class also know all about every other child class. This creates a dependency between one child and every other child of that class. This is an unmaintainable scenario that will cause problems in the future -- if you can even get it to compile or run, which in many languages will not be the case.
That said, it sounds to me like you're not trying to do a class hierarchy, but a collection hierarchy, i.e. a tree. In that case, yes, you're on the right track; it's a common paradigm. The parent node has a collection of child nodes, and the child node has a reference to the parent node.
The thing is? They're all the same class! Here's a very simple example in C#:
public class Node
{
public readonly Node Parent; // null Parent indicates root node
public readonly List<Node> Children = new List<Node>();
public Node(Node parent)
{
Parent = parent;
}
public Node()
{
parent = null;
}
public void AddChild(Node node)
{
Children.Add(node);
}
}
I have a feeling this is what you're really after. Using this paradigm, you would then sub-class Node for whatever nefarious purposes you might have.
If I understand that objects of P contain an array of objects P->c[] representing children. And any node P with no children is a leaf ... with each P containing P->P' (the parent).
The solution you specify, with Parents containing references to children and vice versa eliminates the need to traverse the tree to obtain ancestry of a given child and children of a node. This is really just a tree that can you perform all kinds of links on and algorithms to traverse and enumerate it. Which is fine!
I suggest reading the trees chapter in The Art of Computer Programming for an excellent and in-depth look at tree structures and efficient ways to enumerate parentage and children.
If the children must have a parent I usually just require a parent type instance in the child constructor.
Sounds to me like you're on the path to a bad design. Your architecture should never have circular references.
You should probably re-examine why your children need a reference back to the parent and vice versa. I would lean toward the parent having a collection of children. You can then add functionality to the parent to check to see if a child object is a child of the instance.
A better explination of the goal might be a little more helpful as well...
EDIT
I read up a little more (and listened to comments)...and it turns out I'm quite in the wrong. Circular references do in fact have their place as long as you're careful with them and don't let them get out of hand.

Resources