Any tools can randomly generate the source code according to a language grammar? - random

A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.
I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.
Is there any tool that can do this?

The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.
In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.

Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.
For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.
The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.

If you have a model of the grammar in a normalized form (all rules like this):
LHS = RHS1 RHS2 ... RHSn ;
and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.
Simply start with the goal symbol as a unit tree.
Repeat until no nonterminals are left:
Pick a nonterminal N in the tree;
Expand by adding children for the right hand side of any rule
whose left-hand side matches the nonterminal N
For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.
A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.
Then prettyprint the result. Its the prettyprinter part that is hard to get right.
[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].
A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.

the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices
there is lot more on this subject in:
Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963)
(see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)

Related

An algorithm for compiler designing?

Recently I am thinking about an algorithm constructed by myself. I call it Replacment Compiling.
It works as follows:
Define a language as well as its operators' precedence, such as
(1) store <value> as <id>, replace with: var <id> = <value>, precedence: 1
(2) add <num> to <num>, replace with: <num> + <num>, precedence: 2
Accept a line of input, such as store add 1 to 2 as a;
Tokenize it: <kw,store><kw,add><num,1><kw,to><2><kw,as><id,a><EOF>;
Then scan through all the tokens until reach the end-of-file, find the operation with highest precedence, and "pack" the operation:
<kw,store>(<kw,add><num,1><kw,to><2>)<kw,as><id,a><EOF>
Replace the "sub-statement", the expression in parenthesis, with the defined replacement:
<kw,store>(1 + 2)<kw,as><id,a><EOF>
Repeat until no more statements left:
(<kw,store>(1 + 2)<kw,as><id,a>)<EOF>
(var a = (1 + 2))
Then evaluate the code with the built-in function, eval().
eval("var a = (1 + 2)")
Then my question is: would this algorithm work, and what are the limitations? Is this algorithm works better on simple languages?
This won't work as-is, because there's no way of deciding the precedence of operations and keywords, but you have essentially defined parsing (and thrown in an interpretation step at the end). This looks pretty close to operator-precedence parsing, but I could be wrong in the details of your vision. The real keys to what makes a parsing algorithm are the direction/precedence it reads the code, whether the decisions are made top-down (figure out what kind of statement and apply the rules) or bottom-up (assemble small pieces into larger components until the types of statements are apparent), and whether the grammar is encoded as code or data for a generic parser. (I'm probably overlooking something, but this should give you a starting point to make sense out of further reading.)
More typically, code is generally parsed using an LR technique (LL if it's top-down) that's driven from a state machine with look-ahead and next-step information, but you'll also find the occasional recursive descent. Since they're all doing very similar things (only implemented differently), your rough algorithm could probably be refined to look a lot like any of them.
For most people learning about parsing, recursive-descent is the way to go, since everything is in the code instead of building what amounts to an interpreter for the state machine definition. But most parser generators build an LL or LR compiler.
And I'm obviously over-simplifying the field, since you can see at the bottom of the Wikipedia pages that there's a smattering of related systems that partly revolve around the kind of grammar you have available. But for most languages, those are the big-three algorithms.
What you've defined is a rewriting system: https://en.wikipedia.org/wiki/Rewriting
You can make a compiler like that, but it's hard work and runs slowly, and if you do a really good job of optimizing it then you'll get conventional table-driven parser. It would be better in the end to learn about those first and just start there.
If you really don't want to use a parser generating tool, then the easiest way to write a parser for a simple language by hand is usually recursive descent: https://en.wikipedia.org/wiki/Recursive_descent_parser

Read Non-Binary Expression with custom functions

I want read in mathematical functions and interpret them.
So far I worked with binary expressions and used the Infix to Prefix Method to read my string (e.g. 4*3+1).
However, meanwhile I want to read in also more complex expressions that are not translatable into a binary tree.
Some Examples:
max(x_1,x_2,x_3,x_4,x_5) + max(y_1,y_2)
round(interpolate(x_1,x,y),2)
customfunction(x,y,z) + 4
I have some problems to find a way to translate the given string into a non-binary tree. How would be a good way to do this, are there some known methods?
Since I need to support some own custom functions I can not use any existing library.
I don't expect any code, I'm interested in the theory doing this.
Normally you'll need to define a grammar (a very simple one in this case) that defines what are the rules for parsing your text, and then generate a parser that is based on this grammar through one of the various libraries that exist. For a grammar so simple Antlr is surely overblown. There is a serie of articles here about writing a "Writing a Recursive Descent Parser" or if you search for peg (a family of grammar parsers) on nuget you'll find plenty of implementations.
Note that this branch of computer science is quite vast... You can start from here lexers vs parsers for the theoretical part.
Look at your custom non-binary functions the same way as you look at the binary expressions.
E.g. 2+3*4 translates to +(*(3,4),2) // + and * here are just function names
You can mix in a custom function:
E.g. a^3 + Max(a,b,c)*2 translates to: +(^(a,3), *(2, Max(a,b,c))
In your interpreter define what +(), ^(), Max(), your_custom_function() mean and what parameters (i.e. child nodes to in the tree) to expect. The tree will not be binary, but that does not really change how you create it and traverse it.

Stem comparsion algorithm

I'm writing a program that makes word declension for Polish language. In this language stems can vary in some cases (because of palatalization or mobile/fleeting e and other effects).
For example, we have word "karzeł" and it is basic dictionary form of word. It's stem is also 'karzeł'. But genitive form of this word is "karła" and stem is "karł". We can see here that 'e' dissapeared and 'rz' changes to 'r'.
Another example:
'uzda' -> stem 'uzd'
'uździe' -> stem 'uździ'
Alternation: 'zd' -> 'ździ'
I'd like to store in dictionary only basic form of stem ('karzeł' and 'uzd') and when I'll put in my program stem 'karł' or 'uździ' it will find proper basic stems. Alternations takes place only at the end of stem and contains maximum 4 letters of it.
Is there any algorithms that could do that? Levensthein distance treats all letters equally so if I type word 'barzeł' then the distance to stem 'karzeł' will be less than to stem 'karł'.
I thought also about neural networks but I'm not sure how to encode words (give each stem variation different id?).
Another idea is to write algorith which makes something like reversed alternation and creates set of possible stems and try to find them in dictionary.
I would like to highlight that I only want store basic form of stem and everything else makes on the fly.
First of all, I remember seeing a number of projects on Polish morphology around. So I would look at them first, before starting one of your own.
Regarding Levenshtein, as Pierre correctly noted in the comment, the distance function can be customized. And it should be. Let me put it this way: think of Levenshtein not as an algorithm of and in itself, but as a solution to a specific error model. First he suggests a model which says that when you are typing a word every letter can be either dropped or replaced by another one due to some random process (fingers not pressing the right keys). Then, his algorithm is just a generator of maximum likelihood solutions under this model. The more errors you allow, the smaller is the probability of this sequence of errors actually happening, the bigger is the score.
You (implicitly) state a very different hypothesis, though. That Polish stems may have certain flexibility at the end (some linguistic process that you do not fully understand within this framework). Then, when you strip your suffix (or something that looks like one), there are three options:
1) there is a chance that what you have here is just a different form of a stem you have stored in your dictionary, or
2) it is a completely different stem, or
3) you've stripped your suffix improperly and what you have is not stem at all.
You can heuristically estimate these probabilities by looking at how many letters in the beginning of the supposed stem match some dictionary entries, for example (how to find these entries is a related but different question). And then you can pick the guess that is the most plausible according to your metric/heuristic.
Now, note that you can use any algorithm to find the candidates in the dictionary. Including the Levenshtein algorithm - as long as you are reasonably sure that the right ones will be picked up. But obviously you are better off writing your own dictionary search algorithm that follows your own metric or emulates it. For example, by giving the biggest/prohibitive cost to the change of letters in the beginning of the word and reducing it as you go towards the end.

Left-recursive Grammar Identification

Often we would like to refactor a context-free grammar to remove left-recursion. There are numerous algorithms to implement such a transformation; for example here or here.
Such algorithms will restructure a grammar regardless of the presence of left-recursion. This has negative side-effects, such as producing different parse trees from the original grammar, possibly with different associativity. Ideally a grammar would only be transformed if it was absolutely necessary.
Is there an algorithm or tool to identify the presence of left recursion within a grammar? Ideally this might also classify subsets of production rules which contain left recursion.
There is a standard algorithm for identifying nullable non-terminals, which runs in time linear in the size of the grammar (see below). Once you've done that, you can construct the relation A potentially-starts-with B over all non-terminals A, B. (In fact, it's more normal to construct that relationship over all grammatical symbols, since it is also used to construct FIRST sets, but in this case we only need the projection onto non-terminals.)
Having done that, left-recursive non-terminals are all A such that A potentially-starts-with+ A, where potentially-starts-with+ is:
potentially-starts-with ∘ potentially-starts-with*
You can use any transitive closure algorithm to compute that relation.
For reference, to detect nullable non-terminals.
Remove all useless symbols.
Attach a pointer to every production, initially at the first position.
Put all the productions into a workqueue.
While possible, find a production to which one of the following applies:
If the left-hand-side of the production has been marked as an ε-non-terminal, discard the production.
If the token immediately to the right of the pointer is a terminal, discard the production.
If there is no token immediately to the right of the pointer (i.e., the pointer is at the end) mark the left-hand-side of the production as an ε-non-terminal and discard the production.
If the token immediately to the right of the pointer is a non-terminal which has been marked as an ε-non-terminal, advance the pointer one token to the right and return the production to the workqueue.
Once it is no longer possible to select a production from the work queue, all ε-non-terminals have been identified.
Just for fun, a trivial modification of the above algorithm can be used to do step 1. I'll leave it as an exercise (it's also an exercise in the dragon book). Also left as an exercise is the way to make sure the above algorithm executes in linear time.

What does syntax directed translation mean?

Can anyone, in simple terms, explain what does "Syntax Directed Translation" mean? I started to read the topic from Dragon Book but couldn't understand. The Wiki article didn't help either.
In simplest terms, 'Syntax Directed Translation' means driving the entire compilation (translation) process with the syntax recognizer (the parser).
Conceptually, the process of compiling a program (translating it from source code to machine code) starts with a parser that produces a parse tree, and then transforms that parse tree through a sequence of tree or graph transformations, each of which is largely independent, resulting in a final simplified tree or graph that is traversed to produce machine code.
This view, while nice in theory, has a drawback that if you try to implement it directly, enough memory to hold at least two copies of the entire tree or graph is needed. Back when the Dragon Book was written (and when a lot of this theory was hashed out), computer memories were measured in kilobytes, and 64K was a lot. So compiling large programs could be tricky.
With Syntax Directed Translation, you organize all of the graph transformations around the order in which the parser recognizes the parse tree. Instead of producing a complete parse tree, your parser builds little bits of it, and then feeds those bits to the subsequent passes of the compiler, ultimately producing a small piece of machine code, before continuing the parsing process to build the next piece of parse tree. Since only small amounts of the parse tree (or the subsequent graphs) exist at any time, much less memory is required. Since the syntax recognizer is the master sequencer controlling all of this (deciding the order in which things happen), this is called Syntax Directed Translation.
Since this is such an effective way of keeping down memory use, people even redesigned languages to make it easier to do -- the ideal being to have a "Single Pass" compiler that could in fact do the entire process from parsing to machine code generation in a single pass.
Nowadays, memory is not at such a premium, so there's less pressure to force everything into a single pass. Instead you generally use Syntax Direct Translation just for the front end, parsing the syntax, doing typechecking and other semantic checks, and a few simple transformations all from the parser and producing some internal form (three address code, trees, or dags of some kind) and then having separate optimization and back end passes that are independent (and so not syntax directed). Even in this case you might claim that these later passes are at least partly syntax directed, as the compiler may be organized to operate on large pieces of the input (such as entire functions or modules), pushing through all the passes before continuing with the next piece of input.
Tools like yacc are designed around the idea of Syntax Directed Translation -- the tool produces a syntax recognizer that directly runs fragments of code ('actions' in the tool parlance) as productions (fragments of the parse tree) are recognized, without ever creating an actual 'tree'. These actions can directly invoke what are logically later passes in the compiler, and then return to continue parsing. The imperative main loop that drives all of this is the parser's token reading state machine.
Actually No. Historically before the Dragon Book there were syntax directed compilers. Attending ACM SEGPlan meeting in the late 1960's I learned of several types of directed translation. Tree directed and graph directed translation were also discussed. I think these got muddled together in the Dragon Book though I have never owned the Dragon Book. My favorite book was Programming Systems and Languages by Saul Rosen. It is a collection of papers on compilers, operating systems and computer systems. I'll try to explain the early syntax directed compiler parser programming languages. The later ones producing trees were combined with tree directed code generating languages.
Early syntax directed compilers, translated source directly to stack machine code. The Borrows B5000 ALGOL compiler is an example.
A*(B+C) -> A,B,C,ADD,MPY
Schorre's META II domain specific parser programming language, compiler compiler, developed in the 1960s is an example of a syntax directed compiler. You can find the original META II paper in the ACM archive. META II avoids left recursion using $ postfix zero or more sequence operator and ( ) grouping.
EXPR = TERM $('+' TERM .OUT 'ADD'|'-' TERM .OUT 'SUB');
Later Schorre based metalanguage compilers translated to trees using stack based tree transformation operators :<node name> and !<number>.
EXPR = TERM $(('+':ADD|'-':SUB) TERM!2);
Except for TREEMETA that used [<number>] instead of !<number>. The above EXPR formula is basically the same as the META II EXPR except we have factored operators + and - recognition creating corresponding nodes and pushing the node onto the node stack. Then on recognizing the right TERM the tree constructor !2 creates a tree popping the top 2 parse stack <TERM>s and top node from the node stack to form a tree:
ADD or SUB
/ \ / \
TERM TERM TERM TERM
Tokens were recognized by supplied recognizers .ID .NUMBER and .STRING. Later replaced by token ".." and character class ":" formula in CWIC:
id .. let $(leter|dgt|+'_');
Tree directed compiler languages were combined with the syntax directed compilers to generate code. The CWIC compiler compiler developed at Systems Development Corporation included a LISP 2 based tree directed generator language. A short paper in CWIC can be found in the ACM archives.
In the parser programming languages you are programming a type of recursive decent parser. When you get to CWIC all the problems that today are attributed to recursive decent parsers were eliminated. There is no left recursion problem as the $ zero or more construct and programed tree construction eliminated the need of left recursion. You control the tree construction. A loop construct is used to produces a left handed tree and tail recursion a right handed tree. Though parsing formulas may generate no tree at all:
program = $declarations;
In the above the $ zero or more loop operator preceding declarations specifies that declarations is to be repeatably called as long as it returns success. The input source code being compiled is made up of any positive number of declarations. The declarations formula would then define the types of declarations. You might need external linkages declarations, data declarations, function or procedure code declarations.
declarations = linkage_decl | data_decl | code_decl;
The types of declarations each being a separate formula. The syntax language controls when semantic processing and code generation occurs. The program and declarations formulas above do not produce trees. They are simply controlling when and what language structure are parsed. These are neither LL oe LR parser sears. The provide unlimited (limited only by available memory) programed backtracking. They provide programed look ahead and peak ahead tests.
As a last example the following example including token and character class formula illustrates producing both left and right handed trees. Specifically exponentiation using tail recursion.
assign = id '=' expr ';' :ASSIGN!2 arith_gen[*1];
expr = term $(('+':ADD | '-':SUB) term !2);
term = factor $(('*':MPY | '//' :REM | '/':DIV) factor!2);
factor = ( id ('(' +[ arg $(',' arg ]+ ')' :CALL!2 | .EMPTY)
| number
| '(' expr ')'
) ('^' factor:EXP!2 | .EMPTY);
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';
upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
alpha: upr|lwr;
alphanum: alpha|dgt;
number .. dgt $dgt MAKENUM[];
id .. alpha $(alphanum|+'_');

Resources