Building CFGs from the AST - transformation

As far as I know after the parser builds the AST this structure is transformed into a 'intermediate level' IR, for example three adress code or any other. Then, in order to perfom some analysis this IR is transformed into a control flow graph. My question is, if it is possible going from the AST representation to the CFGs without going through another IR, and then perform, for example, data flow analysis through the CFGs with successfull results?

You can't construct a CFG without first doing scope and then name resolution.
You need scope resolution to determine the "scope" of implicit control transfers, e.g., the boundaries of an if-then-else, a try-statement, blocks with "break statements", return statements, etc. This is generally fairly easy because scopes tend to follow the syntax structure of the language, which you already have with an AST.
You also need scope resolution to determine where identifiers are defined, if your language allows control transfers to named entities ("goto "), "call ". You can't know which a goto targets without knowing which scope contains the goto, and how scope control the lookup of the named label.
With scope resolution you can implement name resolution; this allows you to assign names to the defining scope containing them, and to attach the definition of the name (e.g., knowing that "goto x" refers to the x in specific scope, and thus to line 750 where x is defined in that scope.
Once you have name resolution (so you can look up the definition of x in "goto x"), now you can construct a control flow graph.
You can do all three of these using attribute grammars, which are essentially computations directly over the AST. So, no, you don't need anything other than the AST to implement these. {You can learn more about attribute computations at my SO answer describing them. Of course, anything you can do with a formal attribute computation you can also do by simply writing lots of recursive procedures that walk over the tree and compute the equivalent result; that's what attribute grammars are compiled to in practice.
Here you can find some details on how to extract the control flow graph by attribute computation
One messy glitch are computations like "goto #" (GCC allows this). To do this right, you need to compute data flow for label variables (which technically require you to construct a CFG first :-( ). You can avoid doing the data flow by constructing a conservative answer: "goto #" can go to any label whose address is taken in the function which the goto is found. You can compute this with an attribute grammar, too.
For languages which have only structured control flow, you can actually implement data flow analysis directly by attribute grammar.

Related

What does it mean by express this puzzle as a CSP

What is meant by the below for the attached image.
By labelling each cell with a variable, express the puzzle as a CSP. Hint:
recall that a CSP is composed of three parts.
I initially thought just add variables to each cell like A, B, C etc to each cell and then constrain those cells, but I do not believe that is correct. I do not want the answer just an explanation of what is required. in terms of CSP.
In my opinion, a CSP is best divided into two parts:
State the constraints. This is called the modeling part or model.
Search for solutions using enumeration predicates like labeling/2.
These parts are best kept separate by using a predicate which we call core relation and which has the following properties:
It posts the constraints, i.e., it expresses part (1) above.
Its last argument is the list of variables that still need to be labeled.
By convention, its name ends with an underscore _.
Having this distinction in place allows you to:
try different search strategies without the need to recompile your code
reason about termination properties of the core relation in isolation of any concrete (and often very costly) search.
I can see how some instructors may decompose part (1) into:
1a. stating the domains of the variables, using for example in/2 constraints
1b. stating the other constraints that hold among the variables.
In my view, this distinction is artificial, because in/2 constraints are constraints like all other constraints in the modeling part, but some instructors may teach this separately also for historical reasons, dating back to the time when CSP systems were not as dynamic as they are now.
Nowadays, you can typically post additional domain restrictions any time you like and freely mix in/2 constraints with other constraints in any order.
So, the parts that are expected from you are likely: (a) state in/2 constraints, (b) state further constraints and (c) use enumeration predicates to search for concrete solutions. It also appears that you already have the right idea about how to solve this concrete CSP with this method.

Design patterns/advise on building a Rule engine

I have a need to build an app (Ruby) that allows the user to select one or more patterns and in case those patterns are matched to proceed and complete a set of actions.
While doing my research I've discovered the new (to me) field of rules based systems and have spent some time reading about it and it seems exactly the kind of functionality I need.
The app will be integrated with different web services and would allow rules like these one:
When Highrise contact is added and Zendesk ticket is created do add email to database
I had two ideas to build this. The first is to build some kind os DSL to be able to specify the rule conditions and build them on the fly with the user input.
The second one is to build some rule classes each one having a pattern/matcher and action methods. The pattern would evaluate the expression and return true or false and the action would be executed if the match is positive.
The rules will then need to be persisted and then evaluated periodically.
Can anyone shed some light on this design or point somewhere where I can get more information on this?
Thanks
In a commercial rules engine e.g. Drools, FlexRule... the pattern matching is handled by RETE algorithm. And also, some of them provide multiple different engines for different logic e.g. procedural, validation, inference, flow, workflow,... and they also provide DSL customization...
Rule sequencing and execution is handled based on agenda and activation that can be defined on the engine. And conflict resolution strategy would help you to find proper activation to fire.
I recommend you to use a commercial product hosting on a host/service. And use simple Json/Xml format to communicate to the rule server and execute your rules. This will be giving you a better result probably than creating your own one. However if you are interested in creating your own one as a pattern matching engine consider RETE algorithm, agenda and activation mechanisms for complex production system.
In RETE algorithm you may consider at least implementing Positive and Negative conditions. In implementing RETE you need to implement beta and alpha memories as well ad join nodes that supports left and right activations.
Do you think you could represent your problem in a graph-based representation? I'm pretty sure that your problem can be considered as a graph-based problem
If yes, why don't you use a graph transformation system to define and apply you rules. The one that I would recommend is GrGen.NET. The use of GrGen.NET builds on five steps
Definition of the metamodel: Here, you define you building blocks, i.e. types of graph nodes and graph edges.
Definition of the ruleset: This is where you can put your pattern detecting rules. Moreover, you can create rule encapsulating procedures to manipulate your graph-based data structure.
Compilation: Based on the previous two steps, a C#-assembly (DLL) is created. There should be a way to access such a DLL from Ruby.
Definition of a rule sequence: Rule sequences contain the structure in which individual rules are executed. Typically, it's a logical structure in which the rules are concatenated.
Graph transformation: The application of a rule sequences on a DLL results in the transformation of a graph that can subsequently be exported, saved or further manipulated.
You can find a very good manual of GrGen.NET here: http://www.info.uni-karlsruhe.de/software/grgen/GrGenNET-Manual.pdf

How to store equivalences in connected components labeling algorithm i Fortran

I have to implement connected components labeling algorithm Fortran. I have a clear idea on how to scan thee matrix, but what about storing and recover equivalence classes? I guess that in many other programming languages is an easy task, but i have to do it in Fortran. How can i do it?
First Edit: Following the pseudo code on wikipedia about connected components algorithm, what i have no idea on how to do in Fortran is
linked[label] = union(linked[label], L)
Here are some fragments of an answer. It looks like you need to implement a data structure which represents a set of labels. The first decision you have to make is to decide how to model a label. I see 3 obvious approaches:
Use integers.
Use character variables of length 1 (or 2 or whatever you want).
Define a type with whatever components you want it to have.
The second decision is how to implement a set of labels. I see 3 obvious approaches:
Use an array of labels (array of integers, array of character(len=2), array of type(label), it doesn't matter) whose size is fixed at compile time. You have to be fairly certain that the size you hard-wire is always going to be large enough. This is not a very appealing approach; I should probably not have mentioned it.
Use an array of labels whose size is set at run-time. This means using an allocatable array. You'll have to figure out how to set this to the right size at run-time, if it is possible at all.
Implement a type representing a set of labels. This type might, for example model a set as a linked list. But that is not the only way to model the set, the type might model the set of labels as an array, and do some fancy footwork to re-size the array if required. By defining a type, of course, you give yourself the freedom to change the internal representation of the set without modifying the code which uses the functionality exposed by the set type.
Depending on the choices you have made it should be quite straightforward to implement a union function to add a new label to an existing set of labels.
Note though, that there are many other ways to tackle this problem. You might, for example, start with a set of already-defined component labels and drop from the set the ones you don't need to use.
Since you seem to be new to Fortran, here's a list of language features you need to be familiar with to implement the foregoing.
How much of the Fortran 2003 standard your compiler implements.
Defining, and using, derived types.
Allocatable arrays, allocating arrays, moving allocations.
Arrays of derived types.
Type-bound procedures.
Pointers, and targets.

Languages with native / syntactical / inline graph support?

The graph is arguably the most versatile and valuable data structure of all. I can store single variables, lists, hashes etc., and of course graphs, with it.
Given this, are there any languages that offer inline / native graph support and syntax? I can create variables, arrays, lists and hashes inline in Ruby, Python and Javascript, but if I want a graph, I have to either manage the representation myself with a matrix / list, or select a library, and use the graph through method calls.
Why on earth is this still the case in 2010? And, practically, are there any languages out there which offer inline graph support and syntax?
The main problem of what you are asking is that a more general solution is not the best one for a specific problem. It's just average for all of them but not a best one.
Ok, you can store a list in a graph assuming its degeneracy but why should you do something like that? And how would you store an hashmap inside a graph? Why would you need such a structure?
And do not forgot that graph implementation must be chosen accordingly to which operations you are going to do on it, otherwise it would be like using a hashtable to store a list of values or a list to store an ordered collection instead that a tree. You know that you can use an adjacency matrix, an edge list or adjacency lists.. every different implementation with it's own strenghts and weaknesses.
Then graphs can have really many properties compared to other collections of data, cyclic, acyclic, directed, undirected, bipartite, and so on.. and for any specific case you can implement them in a different way (assuming some hypothesis on the graph you need) so having them in native syntax would be overkill since you would need to configure them anyway (and language should provide many implementations/optimizations).
If everything is already made you remove the fun of developing :)
By the way just look for a language that allows you to write your own graph DSL and live with it!
Gremlin, a graph-based programming language: https://github.com/tinkerpop/gremlin/wiki
GrGen.NET (www.grgen.net) is a programming language for graph transformation plus an environment including a graphical debugger. You can define your graph model, the rewrite rules, and rule control with some nice special purpose languages and use the generated assemblies/C# code from any .NET language you like or from the supplied shell.
To understand why normal languages don't offer such a convenient/built-in interface to graphs, just take a look at the amount of code written for that project: the compiler alone is several man-years of work. That's a price tag too hefty for a feature/data structure only a minority of programmers ever need - so it's not included in general purpose programming languages.

Pattern name for flippable data structure?

I'm trying to think of a naming convention that accurately conveys what's going on within a class I'm designing. On a secondary note, I'm trying to decide between two almost-equivalent user APIs.
Here's the situation:
I'm building a scientific application, where one of the central data structures has three phases: 1) accumulation, 2) analysis, and 3) query execution.
In my case, it's a spatial modeling structure, internally using a KDTree to partition a collection of points in 3-dimensional space. Each point describes one or more attributes of the surrounding environment, with a certain level of confidence about the measurement itself.
After adding (a potentially large number of) measurements to the collection, the owner of the object will query it to obtain an interpolated measurement at a new data point somewhere within the applicable field.
The API will look something like this (the code is in Java, but that's not really important; the code is divided into three sections, for clarity):
// SECTION 1:
// Create the aggregation object, and get the zillion objects to insert...
ContinuousScalarField field = new ContinuousScalarField();
Collection<Measurement> measurements = getMeasurementsFromSomewhere();
// SECTION 2:
// Add all of the zillion objects to the aggregation object...
// Each measurement contains its xyz location, the quantity being measured,
// and a numeric value for the measurement. For example, something like
// "68 degrees F, plus or minus 0.5, at point 1.23, 2.34, 3.45"
foreach (Measurement m : measurements) {
field.add(m);
}
// SECTION 3:
// Now the user wants to ask the model questions about the interpolated
// state of the model. For example, "what's the interpolated temperature
// at point (3, 4, 5)
Point3d p = new Point3d(3, 4, 5);
Measurement result = field.interpolateAt(p);
For my particular problem domain, it will be possible to perform a small amount of incremental work (partitioning the points into a balanced KDTree) during SECTION 2.
And there will be a small amount of work (performing some linear interpolations) that can occur during SECTION 3.
But there's a huge amount of work (constructing a kernel density estimator and performing a Fast Gauss Transform, using Taylor series and Hermite functions, but that's totally beside the point) that must be performed between sections 2 and 3.
Sometimes in the past, I've just used lazy-evaluation to construct the data structures (in this case, it'd be on the first invocation of the "interpolateAt" method), but then if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch.
In other projects, I've required the user to explicitly call an "object.flip()" method, to switch from "append mode" into "query mode". The nice this about a design like this is that the user has better control over the exact moment when the hard-core computation starts. But it can be a nuisance for the API consumer to keep track of the object's current mode. And besides, in the standard use case, the caller never adds another value to the collection after starting to issue queries; data-aggregation almost always fully precedes query preparation.
How have you guys handled designing a data structure like this?
Do you prefer to let an object lazily perform its heavy-duty analysis, throwing away the intermediate data structures when new data comes into the collection? Or do you require the programmer to explicitly flip the data structure from from append-mode into query-mode?
And do you know of any naming convention for objects like this? Is there a pattern I'm not thinking of?
ON EDIT:
There seems to be some confusion and curiosity about the class I used in my example, named "ContinuousScalarField".
You can get a pretty good idea for what I'm talking about by reading these wikipedia pages:
http://en.wikipedia.org/wiki/Scalar_field
http://en.wikipedia.org/wiki/Vector_field
Let's say you wanted to create a topographical map (this is not my exact problem, but it's conceptually very similar). So you take a thousand altitude measurements over an area of one square mile, but your survey equipment has a margin of error of plus-or-minus 10 meters in elevation.
Once you've gathered all the data points, you feed them into a model which not only interpolates the values, but also takes into account the error of each measurement.
To draw your topo map, you query the model for the elevation of each point where you want to draw a pixel.
As for the question of whether a single class should be responsible for both appending and handling queries, I'm not 100% sure, but I think so.
Here's a similar example: HashMap and TreeMap classes allow objects to be both added and queried. There aren't separate interfaces for adding and querying.
Both classes are also similar to my example, because the internal data structures have to be maintained on an ongoing basis in order to support the query mechanism. The HashMap class has to periodically allocate new memory, re-hash all objects, and move objects from the old memory to the new memory. A TreeMap has to continually maintain tree balance, using the red-black-tree data structure.
The only difference is that my class will perform optimally if it can perform all of its calculations once it knows the data set is closed.
If an object has two modes like this, I would suggest exposing two interfaces to the client. If the object is in append mode, then you make sure that the client can only ever use the IAppendable implementation. To flip to query mode, you add a method to IAppendable such as AsQueryable. To flip back, call IQueryable.AsAppendable.
You can implement IAppendable and IQueryable on the same object, and keep track of the state in the same way internally, but having two interfaces makes it clear to the client what state the object is in, and forces the client to deliberately make the (expensive) switch.
I generally prefer to have an explicit change, rather than lazily recomputing the result. This approach makes the performance of the utility more predictable, and it reduces the amount of work I have to do to provide a good user experience. For example, if this occurs in a UI, where do I have to worry about popping up an hourglass, etc.? Which operations are going to block for a variable amount of time, and need to be performed in a background thread?
That said, rather than explicitly changing the state of one instance, I would recommend the Builder Pattern to produce a new object. For example, you might have an aggregator object that does a small amount of work as you add each sample. Then instead of your proposed void flip() method, I'd have a Interpolator interpolator() method that gets a copy of the current aggregation and performs all your heavy-duty math. Your interpolateAt method would be on this new Interpolator object.
If your usage patterns warrant, you could do simple caching by keeping a reference to the interpolator you create, and return it to multiple callers, only clearing it when the aggregator is modified.
This separation of responsibilities can help yield more maintainable and reusable object-oriented programs. An object that can return a Measurement at a requested Point is very abstract, and perhaps a lot of clients could use your Interpolator as one strategy implementing a more general interface.
I think that the analogy you added is misleading. Consider an alternative analogy:
Key[] data = new Key[...];
data[idx++] = new Key(...); /* Fast! */
...
Arrays.sort(data); /* Slow! */
...
boolean contains = Arrays.binarySearch(data, datum) >= 0; /* Fast! */
This can work like a set, and actually, it gives better performance than Set implementations (which are implemented with hash tables or balanced trees).
A balanced tree can be seen as an efficient implementation of insertion sort. After every insertion, the tree is in a sorted state. The predictable time requirements of a balanced tree are due to the fact the cost of sorting is spread over each insertion, rather than happening on some queries and not others.
The rehashing of hash tables does result in less consistent performance, and because of that, aren't appropriate for certain applications (perhaps a real-time microcontroller). But even the rehashing operation depends only on the load factor of the table, not the pattern of insertion and query operations.
For your analogy to hold strictly, you would have to "sort" (do the hairy math) your aggregator with each point you add. But it sounds like that would be cost prohibitive, and that leads to the builder or factory method patterns. This makes it clear to your clients when they need to be prepared for the lengthy "sort" operation.
Your objects should have one role and responsibility. In your case should the ContinuousScalarField be responsible for interpolating?
Perhaps you might be better off doing something like:
IInterpolator interpolator = field.GetInterpolator();
Measurement measurement = Interpolator.InterpolateAt(...);
I hope this makes sense, but without fully understanding your problem domain it's hard to give you a more coherent answer.
"I've just used lazy-evaluation to construct the data structures" -- Good
"if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch." -- Interesting
"in the standard use case, the caller never adds another value to the collection after starting to issue queries" -- Whoops, false alarm, actually not interesting.
Since lazy eval fits your use case, stick with it. That's a very heavily used model because it is so delightfully reliable and fits most use cases very well.
The only reason for rethinking this is (a) the use case change (mixed adding and interpolation), or (b) performance optimization.
Since use case changes are unlikely, you might consider the performance implications of breaking up interpolation. For example, during idle time, can you precompute some values? Or with each add is there a summary you can update?
Also, a highly stateful (and not very meaningful) flip method isn't so useful to clients of your class. However, breaking interpolation into two parts might still be helpful to them -- and help you with optimization and state management.
You could, for example, break interpolation into two methods.
public void interpolateAt( Point3d p );
public Measurement interpolatedMasurement();
This borrows the relational database Open and Fetch paradigm. Opening a cursor can do a lot of preliminary work, and may start executing the query, you don't know. Fetching the first row may do all the work, or execute the prepared query, or simply fetch the first buffered row. You don't really know. You only know that it's a two part operation. The RDBMS developers are free to optimize as they see fit.
Do you prefer to let an object lazily perform its heavy-duty analysis,
throwing away the intermediate data structures when new data comes
into the collection? Or do you require the programmer to explicitly
flip the data structure from from append-mode into query-mode?
I prefer using data structures that allow me to incrementally add to it with "a little more work" per addition, and to incrementally pull the data I need with "a little more work" per extraction.
Perhaps if you do some "interpolate_at()" call in the upper-right corner of your region, you only need to do calculations involving the points in that upper-right corner,
and it doesn't hurt anything to leave the other 3 quadrants "open" to new additions.
(And so on down the recursive KDTree).
Alas, that's not always possible -- sometimes the only way to add more data is to throw away all the previous intermediate and final results, and re-calculate everything again from scratch.
The people who use the interfaces I design -- in particular, me -- are human and fallible.
So I don't like using objects where those people must remember to do things in a certain way, or else things go wrong -- because I'm always forgetting those things.
If an object must be in the "post-calculation state" before getting data out of it,
i.e. some "do_calculations()" function must be run before the interpolateAt() function gets valid data,
I much prefer letting the interpolateAt() function check if it's already in that state,
running "do_calculations()" and updating the state of the object if necessary,
and then returning the results I expected.
Sometimes I hear people describe such a data structure as "freeze" the data or "crystallize" the data or "compile" or "put the data into an immutable data structure".
One example is converting a (mutable) StringBuilder or StringBuffer into an (immutable) String.
I can imagine that for some kinds of analysis, you expect to have all the data ahead of time,
and pulling out some interpolated value before all the data has put in would give wrong results.
In that case,
I'd prefer to set things up such that the "add_data()" function fails or throws an exception
if it (incorrectly) gets called after any interpolateAt() call.
I would consider defining a lazily-evaluated "interpolated_point" object that doesn't really evaluate the data right away, but only tells that program that sometime in the future that data at that point will be required.
The collection isn't actually frozen, so it's OK to continue adding more data to it,
up until the point something actually extract the first real value from some "interpolated_point" object,
which internally triggers the "do_calculations()" function and freezes the object.
It might speed things up if you know not only all the data, but also all the points that need to be interpolated, all ahead of time.
Then you can throw away data that is "far away" from the interpolated points,
and only do the heavy-duty calculations in regions "near" the interpolated points.
For other kinds of analysis, you do the best you can with the data you have, but when more data comes in later, you want to use that new data in your later analysis.
If the only way to do that is to throw away all the intermediate results and recalculate everything from scratch, then that's what you have to do.
(And it's best if the object automatically handled this, rather than requiring people to remember to call some "clear_cache()" and "do_calculations()" function every time).
You could have a state variable. Have a method for starting the high level processing, which will only work if the STATE is in SECTION-1. It will set the state to SECTION-2, and then to SECTION-3 when it is done computing. If there's a request to the program to interpolate a given point, it will check if the state is SECTION-3. If not, it will request the computations to begin, and then interpolate the given data.
This way, you accomplish both - the program will perform its computations at the first request to interpolate a point, but can also be requested to do so earlier. This would be convenient if you wanted to run the computations overnight, for example, without needing to request an interpolation.

Resources