How to build syntax trees to represent code snippets if they cannot be compiled successfully?
Can I use PCFG to to represent the non-compilable code?
Related
I'm trying to perform Natural Language Processing (NLP) analysis on source code, and especially on Ruby files. In particular, I want to extract identifiers and comments, considering the structure of the code.
My first attempt was using off-the-shelf NLP libraries, such as Lucene or spacy. However, I was not able to remove all the noise coming from keywords, literals, and the typical stuff in source code.
My second attempt is about to obtain the AST of a particular piece of code, and then extract some parts. There are multiple tools and libraries for a number of languages, but I'm not able to find anything specific to parse Ruby code. So far, my main option is using ANTLR 4, and tailor a Ruby-like grammar (Corundum) to work also with OOP.
Is there a more straightforward path to what I'm looking for?
I have written an LR(1) compiler compiler and to me table generation is pretty straight forward. However today I found myself wondering if there is a general algorithm for generating a recursive decent parser. I know there are tools such as javacc that do this but I'm more interested in the general steps of this generation. Thanks in advance.
I read the article over at http://parsingintro.sourceforge.net/ and decided to try to rewrite it as an exercise in Ruby. Two reasons made me do this, I wanted to learn more about how to code Ruby (background in Java, PHP, C and some Python) and I wanted to learn more about parsers / compilers.
I have all the code posted at https://github.com/parse/boatcaptain. The AST tree is being generated, unfortunatly the author of the article doesn't get into concepts such as code generation and optimizations.
Can anyone help me by pointing me in the right direction on how to achieve this AST tree into "code"? This is the AST tree that is generated
I wrote a calculator in Java a few years ago, it uses a lot of similar terminology and techniques as I used in this parser. But in the calculator I had methods for eval()-ing my "classes" and therefore getting output, should I aim for doing something similar here? Source for calculator: https://github.com/parse/Uppsala-University-Courses/blob/master/ImpOOP-Calculator/src/Calculator.java
I would love feedback on my way of writing Ruby as well, I believe I still write Ruby like I would write Python, missing some nice advantages of Ruby.
Code Generation in its most basic form is simply traversing your intermediate form - the AST - and emitting corresponding instructions in your target language.
Firstly you'll need to choose a target language. What platform do you want your input file to run on? The main options open to you are:
A source-to-source translator
A compiler to native code
A compiler to bytecode (to be run just-in-time on a VM)
The choice of target language can determine the amount of work you'll have to put in to map between languages. Mapping object-oriented classes down to ASM could/would be tricky, for example. Mapping inherently procedural code to stack-based code could also prove a challenge.
Whichever language you choose, the problem will no doubt boil down to the following procedure: visit the nodes of your tree, and depending on their type, emit the corresponding instruction.
Say you come across the following node in your AST (as in the one you linked to):
=
delta /
alpha beta
Seeing as it's an 'assignment' node, the Code Generator then knows it has to evaluate the RHS of the tree before sticking that value into the LHS; 'delta'. So we follow the RHS node down, and see it is a division operation. We then know we have to evaluate both the LHS and RHS of this node, before dividing them, and sticking the result in 'delta'.
So now we move down the LHS, see it's a variable, and we emit a 'load' instruction. We go back up and then down the RHS, and likewise emit a 'load' for 'beta'. We then walk back up the tree (taking both alpha and beta with us), emit the divide instruction on both the operands, store that result, pass it up the tree to the assignment emitter, and let that store it in 'delta'.
So the resulting code for this snippet might be:
load alpha
load beta
tmp = div alpha beta
store delta tmp
As for pre-existing Ruby Code Generator libraries, I'm not aware of any, sorry. I hope this answer wasn't too general or simplistic for you.
At which point the compiler builds the syntax tree? How does it form the tree and translate the tree while building the executable?
A compiler that builds a syntax tree does so during the parsing step. It does so, typically by generating a tree node for each grammar rule that matches the input stream.
Code generation requires considerable analysis of the tree to understand types, operations, opportunities for optimizations, etc. Often this is hard to do well on the tree directly, so other intermediate representations are used (triples, static single assignment, ...). Often even the intermediate stages are inappropriate for machine code generations, so some kind of representation of machine intructions might be constructed (RTL), ...
The point is that trees aren't the only representation the compiler uses to generate code.
It is well worth your trouble to read an introductory compiler text book (Aho and Ullman, "Compilers") to get more details.
If I wanted to learn about pattern recognition in general what would be a good place to start (recommend a book)?
Also, does anybody have any experience/knowledge on how to go about applying these algorithms to find abstraction patterns in programs? (repeated code, chunks of code that do the same thing, but in slightly different ways, etc.)
Thanks
Edit: I don't mind mathematically intensive books. In fact, that would be a good thing.
If you are reasonably mathematically confident then either of Chris Bishop's books "Pattern Recognition and Machine Learning" or "Neural Networks for Pattern Recognition" are very good for learning about pattern recognition.
It helps if you have access to the parse tree generated during compilation. This way you can look for pieces of the tree which are similar, ignoring the nodes which are deeper than what you are looking at, this way you can pick out e.g. nodes which multiply together two sub-expressions, ignoring the contents of the sub-expressions. You can apply the same logic to a collection of nodes, e.g. you want to find a multiplication of two sub-expressions where those two sub-expressions are additions of more sub-expressions. You first look for multiplies, then check if the two nodes underneath the multiply are additions, ignoring anything any deeper.
I'd suggest looking at the code of some open source project (e.g. FindBugs or SIM)
that does the kind of thing you're talking about.
If you're working in one of the supported languages, IntelliJ idea has a really smart structural search and replace that would fit your problem.
Other interesting projects are PMD and Eclipse.
Eclipse uses AST (abstract syntax trees) for all source code in any project. Tools can then register for certain types of ASTs (like Java source) and get a preprocessed view where they can add additional information (like links to documentation, error markers, etc).
Another project you can look into is Duplo - it's an open-source/GPL project, so you can pore over their approach by grabbing the code from SourceForge.
This is specific to .Net and visual studio, but it finds duplicate code in your project. It does report some false positives I've found but it could be a good place to start.
Clone Detective
One kind of pattern is code that has been cloned by copy and paste methods. See CloneDR for a tool that automatically finds such code in spite of variations in layout and even changes in the body of the clone, by comparing abstract syntax trees for the language in question.
CloneDR works with a variety of langauges: C, C++, C#, Java, JavaScript, PHP, COBOL, Python, ... The website shows clone detection reports for a variety of programming languages.