How to generate code for AST tree parsed from a fictive language? - ruby

I read the article over at http://parsingintro.sourceforge.net/ and decided to try to rewrite it as an exercise in Ruby. Two reasons made me do this, I wanted to learn more about how to code Ruby (background in Java, PHP, C and some Python) and I wanted to learn more about parsers / compilers.
I have all the code posted at https://github.com/parse/boatcaptain. The AST tree is being generated, unfortunatly the author of the article doesn't get into concepts such as code generation and optimizations.
Can anyone help me by pointing me in the right direction on how to achieve this AST tree into "code"? This is the AST tree that is generated
I wrote a calculator in Java a few years ago, it uses a lot of similar terminology and techniques as I used in this parser. But in the calculator I had methods for eval()-ing my "classes" and therefore getting output, should I aim for doing something similar here? Source for calculator: https://github.com/parse/Uppsala-University-Courses/blob/master/ImpOOP-Calculator/src/Calculator.java
I would love feedback on my way of writing Ruby as well, I believe I still write Ruby like I would write Python, missing some nice advantages of Ruby.

Code Generation in its most basic form is simply traversing your intermediate form - the AST - and emitting corresponding instructions in your target language.
Firstly you'll need to choose a target language. What platform do you want your input file to run on? The main options open to you are:
A source-to-source translator
A compiler to native code
A compiler to bytecode (to be run just-in-time on a VM)
The choice of target language can determine the amount of work you'll have to put in to map between languages. Mapping object-oriented classes down to ASM could/would be tricky, for example. Mapping inherently procedural code to stack-based code could also prove a challenge.
Whichever language you choose, the problem will no doubt boil down to the following procedure: visit the nodes of your tree, and depending on their type, emit the corresponding instruction.
Say you come across the following node in your AST (as in the one you linked to):
=
delta /
alpha beta
Seeing as it's an 'assignment' node, the Code Generator then knows it has to evaluate the RHS of the tree before sticking that value into the LHS; 'delta'. So we follow the RHS node down, and see it is a division operation. We then know we have to evaluate both the LHS and RHS of this node, before dividing them, and sticking the result in 'delta'.
So now we move down the LHS, see it's a variable, and we emit a 'load' instruction. We go back up and then down the RHS, and likewise emit a 'load' for 'beta'. We then walk back up the tree (taking both alpha and beta with us), emit the divide instruction on both the operands, store that result, pass it up the tree to the assignment emitter, and let that store it in 'delta'.
So the resulting code for this snippet might be:
load alpha
load beta
tmp = div alpha beta
store delta tmp
As for pre-existing Ruby Code Generator libraries, I'm not aware of any, sorry. I hope this answer wasn't too general or simplistic for you.

Related

If you write a compiler in pure Prolog, will it work as a decompiler also?

If you write a compiler in pure Prolog (no extra-logical bits), will it work as a decompiler also?
(A book I was reading opined on this, but I wonder if anyone has actually tried it)
I once wrote the equivalent of cdecl.org as a reversible program. It was a bit tricky, but I demonstrated that it could be done. (Somewhere in a pile of papers is the source code; one of these days, I hope to publish it on github.) The code was 2 or 3 times as compact at some existing code that used tools such as yacc/lex (bison/flex).
For something like cdecl -- where you're translating between char ** const * const x and declare x as const pointer to const pointer to pointer to char, compiling/decompiling makes sense. But what does it mean to translate from arbitrary machine code to source code? Even translating between some IR and source code doesn't seem to make a lot of sense.
This question needs to be much more precise, as we don't know what a "compiler" is (an extraneous-information-dumping transformation from a graph - the program in language 1 - to another graph - the algorithmically equivalent graph in language 2, I suppose). It also not clear what "no-extra logical bits implies". If yo get rid of these, what kind of compilers can you still build?
Seen this way, compilation looks like pure deduction (Prolog running forward, or CHR) while decompilation looks like possibly very hard search (you will get a program among the gazillion possible ones but it won't be pleasant too look at and in no way resemble the one you had earlier). Someone who as a toolbox of theorems freshly in his mind can certainly say more.
But I would say not automagically, no. For one, there will be no guarantee that an infinite "recursion on the left" loop won't appear when "decompiling".

How to apply Grammatical Evolution string to a solution

I am learning Grammatical Evolution, but one thing that I can't seem to grasp is how to use the strings that are evolved from grammar into solving an actual problem. Is it converted into a neural network or converted into an equation, or something else? How does it receive inputs and print out outputs?
Grammatical Evolution (GE) makes distinction between genotype and phenotype (genotype–phenotype distinction), which means an evolved genotype is not a solution by itself, but it maps to a solution.
Mutations and crossover are performed over genotypes, but to evaluate the fitness a genotype should be first transformed into a phenotype. In Grammatical Evolution this means generation of a string conforming to the chosen grammar. This solution string then should be executed, and the result of the execution evaluated to estimate the fitness of the solution.
How to actually execute the generated solution?
It highly depends on the implementation of a GE system.
If it generates solutions in some real programming language, they should be compiled and/or executed with the corresponding toolchain, ran with some test input, and the output evaluated to estimate the fitness.
If a GE system is able to execute a solution internally, no external toolchain is involved. It might be convenient to generate a syntax tree-like structure according to the grammar (instead of unstructured text), because it's quite easy to execute such a structure.
How to execute a syntax tree?
There exist an entire class of so called tree walk interpreters — not super performant, but reasonably simple in implementation. Usually such an interpreter first parses a source text and builds a syntax tree, then executes it; but in a GE system it is possible to directly generate a syntax tree, so no parsing is involved.
I can suggest "A tree-walk interpreter" chapter of a freely available book "Crafting interpreters" as a good example of constructing such an interpreter.

When to reuse functions?

I have a function in my program that generates random strings.
func randString(s []rune, l int) string
s is a slice of runes containing the possible characters in the string. I pass
in a rune slice of both capital and lowercase alphabetic characters. l
determines the length of the string. This works great. But I also need to
generate random hex strings for html color codes.
It seems all sources say that it's good programming practice to reuse code. So I
made another []rune that held [1-9a-f] and feed that into randString. That
was before I realized that the stdlib already inclues formatting verbs for int
types that suit me perfectly.
In practice, is it better to reuse my randString function or code a separate
(more efficient) function? I would generate a single random int and Sprintf it
rather than having to loop and generate 6 random ints which randString does.
1) If there is an exact solution in the standard library, you should like always choose to use that.
Because:
The standard library is tested. So it does what it says (or what we expect it to do). Even if there is a bug in it, it will be discovered (by you or by others) and will get fixed without your work/effort.
The standard library is written as idiomatic Go. Chances are it's faster even if it does a little more than what you need compared to the solution you could write.
The standard library is (or may) improve by time. Your program may get faster just because an implementation was improved in a new Go release without any effort from your part.
The solution is presented (which means it's ready and requires no time from you).
The standard library is well and widely known, so your code will be easier to understand by others and by you later on.
If you're already imported the package (or will in the near future), this means zero or minimal overhead as libraries are statically linked, so the function you need is already linked to your program (to the compiled executable binary).
2) If there is a solution provided by the standard library but it is a general solution to similar problems and/or offers more than what you need:
That means it's more likely not the optimal solution for you, as it may use more memory and/or work more slowly as your solution could be.
You need to decide if you're willing to sacrifice that little performance loss for the gains listed above. This also depends how and how many times you need to use it (e.g. if it's a one-time, it shouldn't matter, if it's in an endless loop called very frequently, it should be examined carefully).
3) And at the other end: you should avoid using a solution provided by the standard library if it wasn't designed to solve your problem...
If it just happens that its "side-effect" solves your problem: Even if the current implementation would be acceptable, if it was designed for something else, future improvements to it could render your usage of it completely useless or could even break it.
Not to mention it would confuse other developers trying to read, improve or use your code (you included, after a certain amount of time).
As a side note: this question is exactly about the function you're trying to create: How to generate a random string of a fixed length in golang? I've presented mutiple very efficient solutions.
This is fairly subjective and not go-specific but I think you shouldn't reuse code just for the sake of reuse. The more code you reuse the more dependencies you create between different parts of your app and as result it becomes more difficult to maintain and modify. Easy to understand and modify code is much more important especially if you work in a team.
For your particular example I would do the following.
If a random color is generated only once in your package/application then using fmt.Sprintf("#%06x", rand.Intn(256*256*256)) is perfectly fine (as suggested by Dave C).
If random colors are generated in multiple places I would create function func randColor() string and call it. Note that now you can optimize randColor implementation however you like without changing the rest of the code. For example you could have implemented randColor using randString initially and then switched to a more efficient implementation later.

ANTLR's tree-grammar AST graphical view

Im currently building a Javascript compiler in ANTLR and JAVA.
I use ANTLR's tree-grammar for generating ASTs. (Still in doubt whether this is smarter than a heterogeneous approach with a manually defined Abstract class for generating nodes, but that's another topic).
My problem is that when i have parsed some input, lets say, var x = 5; this is internally represented as; VARDECL as root and x as left child and 5 as right child.
I now have the option to print this tree, using the toStringTree() command, which outputs (VARDECL x 5) - this representations gets quite hard to comprehend in larger programs, so i was wondering if there exists a third party tool that takes this textual tree-representation as input and can output a nice graphically model of the tree? (Or do i have to implement that as well)
Regards Sune.
Checkout this previous Q&A how to create a graphical tree of your AST using Graphviz' DOT language.
Just in case you're writing your own JavaScript grammar, have a look at the list of grammars on ANTLR wiki: there are many ECMA/JS grammars available that you can use.
Lastly, you may want to have a look at this previous Q&A where I posted an answer that shows how to evaluate a language (expressions, in this case) with a tree grammar using custom tree nodes. Of course, you'll have much more different nodes because the language is more complex (assignments, functions, scopes, etc.), but you could started with that example.

Pseudocode interpreter?

Like lots of you guys on SO, I often write in several languages. And when it comes to planning stuff, (or even answering some SO questions), I actually think and write in some unspecified hybrid language. Although I used to be taught to do this using flow diagrams or UML-like diagrams, in retrospect, I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, perl, Basic. I seem to unconsciously select the idiom best suited to expressing the concept/algorithm.
Common idioms might include Java-like braces for scope, pythonic list comprehensions or indentation, C++like inheritance, C#-style lambdas, matlab-like slices and matrix operations.
I noticed that it's actually quite easy for people to recognise exactly what I'm triying to do, and quite easy for people to intelligently translate into other languages. Of course, that step involves considering the corner cases, and the moments where each language behaves idiosyncratically.
But in reality, most of these languages share a subset of keywords and library functions which generally behave identically - maths functions, type names, while/for/if etc. Clearly I'd have to exclude many 'odd' languages like lisp, APL derivatives, but...
So my questions are,
Does code already exist that recognises the programming language of a text file? (Surely this must be a less complicated task than eclipse's syntax trees or than google translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?
Is it theoretically possible to create a single interpreter or compiler that recognises what language idiom you're using at any moment and (maybe "intelligently") executes or translates to a runnable form. And flags the corner cases where my syntax is ambiguous with regards to behaviour. Immediate difficulties I see include: knowing when to switch between indentation-dependent and brace-dependent modes, recognising funny operators (like *pointer vs *kwargs) and knowing when to use list vs array-like representations.
Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?
Have I missed an obvious obstacle to this being possible?
edit
Thanks all for your answers and ideas. I am planning to write a constraint-based heuristic translator that could, potentially, "solve" code for the intended meaning and translate into real python code. It will notice keywords from many common languages, and will use syntactic clues to disambiguate the human's intentions - like spacing, brackets, optional helper words like let or then, context of how variables are previously used etc, plus knowledge of common conventions (like capital names, i for iteration, and some simplistic limited understanding of naming of variables/methods e.g containing the word get, asynchronous, count, last, previous, my etc). In real pseudocode, variable naming is as informative as the operations themselves!
Using these clues it will create assumptions as to the implementation of each operation (like 0/1 based indexing, when should exceptions be caught or ignored, what variables ought to be const/global/local, where to start and end execution, and what bits should be in separate threads, notice when numerical units match / need converting). Each assumption will have a given certainty - and the program will list the assumptions on each statement, as it coaxes what you write into something executable!
For each assumption, you can 'clarify' your code if you don't like the initial interpretation. The libraries issue is very interesting. My translator, like some IDE's, will read all definitions available from all modules, use some statistics about which classes/methods are used most frequently and in what contexts, and just guess! (adding a note to the program to say why it guessed as such...) I guess it should attempt to execute everything, and warn you about what it doesn't like. It should allow anything, but let you know what the several alternative interpretations are, if you're being ambiguous.
It will certainly be some time before it can manage such unusual examples like #Albin Sunnanbo's ImportantCustomer example. But I'll let you know how I get on!
I think that is quite useless for everything but toy examples and strict mathematical algorithms. For everything else the language is not just the language. There are lots of standard libraries and whole environments around the languages. I think I write almost as many lines of library calls as I write "actual code".
In C# you have .NET Framework, in C++ you have STL, in Java you have some Java libraries, etc.
The difference between those libraries are too big to be just syntactic nuances.
<subjective>
There has been attempts at unifying language constructs of different languages to a "unified syntax". That was called 4GL language and never really took of.
</subjective>
As a side note I have seen a code example about a page long that was valid as c#, Java and Java script code. That can serve as an example of where it is impossible to determine the actual language used.
Edit:
Besides, the whole purpose of pseudocode is that it does not need to compile in any way. The reason you write pseudocode is to create a "sketch", however sloppy you like.
foreach c in ImportantCustomers{== OrderValue >=$1M}
SendMailInviteToSpecialEvent(c)
Now tell me what language it is and write an interpreter for that.
To detect what programming language is used: Detecting programming language from a snippet
I think it should be possible. The approach in 1. could be leveraged to do this, I think. I would try to do it iteratively: detect the syntax used in the first line/clause of code, "compile" it to intermediate form based on that detection, along with any important syntax (e.g. begin/end wrappers). Then the next line/clause etc. Basically write a parser that attempts to recognize each "chunk". Ambiguity could be flagged by the same algorithm.
I doubt that this has been done ... seems like the cognitive load of learning to write e.g. python-compatible pseudocode would be much easier than trying to debug the cases where your interpreter fails.
a. I think the biggest problem is that most pseudocode is invalid in any language. For example, I might completely skip object initialization in a block of pseudocode because for a human reader it is almost always straightforward to infer. But for your case it might be completely invalid in the language syntax of choice, and it might be impossible to automatically determine e.g. the class of the object (it might not even exist). Etc.
b. I think the best you can hope for is an interpreter that "works" (subject to 4a) for your pseudocode only, no-one else's.
Note that I don't think that 4a,4b are necessarily obstacles to it being possible. I just think it won't be useful for any practical purpose.
Recognizing what language a program is in is really not that big a deal. Recognizing the language of a snippet is more difficult, and recognizing snippets that aren't clearly delimited (what do you do if four lines are Python and the next one is C or Java?) is going to be really difficult.
Assuming you got the lines assigned to the right language, doing any sort of compilation would require specialized compilers for all languages that would cooperate. This is a tremendous job in itself.
Moreover, when you write pseudo-code you aren't worrying about the syntax. (If you are, you're doing it wrong.) You'll wind up with code that simply can't be compiled because it's incomplete or even contradictory.
And, assuming you overcame all these obstacles, how certain would you be that the pseudo-code was being interpreted the way you were thinking?
What you would have would be a new computer language, that you would have to write correct programs in. It would be a sprawling and ambiguous language, very difficult to work with properly. It would require great care in its use. It would be almost exactly what you don't want in pseudo-code. The value of pseudo-code is that you can quickly sketch out your algorithms, without worrying about the details. That would be completely lost.
If you want an easy-to-write language, learn one. Python is a good choice. Use pseudo-code for sketching out how processing is supposed to occur, not as a compilable language.
An interesting approach would be a "type-as-you-go" pseudocode interpreter. That is, you would set the language to be used up front, and then it would attempt to convert the pseudo code to real code, in real time, as you typed. An interactive facility could be used to clarify ambiguous stuff and allow corrections. Part of the mechanism could be a library of code which the converter tried to match. Over time, it could learn and adapt its translation based on the habits of a particular user.
People who program all the time will probably prefer to just use the language in most cases. However, I could see the above being a great boon to learners, "non-programmer programmers" such as scientists, and for use in brainstorming sessions with programmers of various languages and skill levels.
-Neil
Programs interpreting human input need to be given the option of saying "I don't know." The language PL/I is a famous example of a system designed to find a reasonable interpretation of anything resembling a computer program that could cause havoc when it guessed wrong: see http://horningtales.blogspot.com/2006/10/my-first-pli-program.html
Note that in the later language C++, when it resolves possible ambiguities it limits the scope of the type coercions it tries, and that it will flag an error if there is not a unique best interpretation.
I have a feeling that the answer to 2. is NO. All I need to prove it false is a code snippet that can be interpreted in more than one way by a competent programmer.
Does code already exist that
recognises the programming language
of a text file?
Yes, the Unix file command.
(Surely this must be a less
complicated task than eclipse's syntax
trees or than google translate's
language guessing feature, right?) In
fact, does the SO syntax highlighter
do anything like this?
As far as I can tell, SO has a one-size-fits-all syntax highlighter that tries to combine the keywords and comment syntax of every major language. Sometimes it gets it wrong:
def median(seq):
"""Returns the median of a list."""
seq_sorted = sorted(seq)
if len(seq) & 1:
# For an odd-length list, return the middle item
return seq_sorted[len(seq) // 2]
else:
# For an even-length list, return the mean of the 2 middle items
return (seq_sorted[len(seq) // 2 - 1] + seq_sorted[len(seq) // 2]) / 2
Note that SO's highlighter assumes that // starts a C++-style comment, but in Python it's the integer division operator.
This is going to be a major problem if you try to combine multiple languages into one. What do you do if the same token has different meanings in different languages? Similar situations are:
Is ^ exponentiation like in BASIC, or bitwise XOR like in C?
Is || logical OR like in C, or string concatenation like in SQL?
What is 1 + "2"? Is the number converted to a string (giving "12"), or is the string converted to a number (giving 3)?
Is there any language or interpreter
in existence, that can manage this
kind of flexible interpreting?
On another forum, I heard a story of a compiler (IIRC, for FORTRAN) that would compile any program regardless of syntax errors. If you had the line
= Y + Z
The compiler would recognize that a variable was missing and automatically convert the statement to X = Y + Z, regardless of whether you had an X in your program or not.
This programmer had a convention of starting comment blocks with a line of hyphens, like this:
C ----------------------------------------
But one day, they forgot the leading C, and the compiler choked trying to add dozens of variables between what it thought was subtraction operators.
"Flexible parsing" is not always a good thing.
To create a "pseudocode interpreter," it might be necessary to design a programming language that allows user-defined extensions to its syntax. There already are several programming languages with this feature, such as Coq, Seed7, Agda, and Lever. A particularly interesting example is the Inform programming language, since its syntax is essentially "structured English."
The Coq programming language allows "syntax extensions", so the language can be extended to parse new operators:
Notation "A /\ B" := (and A B).
Similarly, the Seed7 programming language can be extended to parse "pseudocode" using "structured syntax definitions." The while loop in Seed7 is defined in this way:
syntax expr: .while.().do.().end.while is -> 25;
Alternatively, it might be possible to "train" a statistical machine translation system to translate pseudocode into a real programming language, though this would require a large corpus of parallel texts.

Resources