I'm trying to create a parser using Treetop that is somewhat recursive. An expression can be a number but it also can be an addition of expressions, so I wrote this:
grammar Language
rule expression
"(" _ expression _ ")" / addition / integer
end
rule addition
expression _ "+" _ expression
/
expression _ "-" _ expression
end
rule integer
'-'? _ [0-9]+
end
# space
rule _
' '*
end
end
That just doesn't work. Any time I'm trying to parse anything I get an exception "SystemStackError: stack level too deep" (stack overflow! yay!). Any ideas why? what's the correct way to specify such a recursive definition with Treetop?
You grammar is left-recursive: i.e. a expression can immediately be an addition which in its turn can be an expression etc. causing the parser to go in an infinite loop.
Try something like this instead (untested!):
grammar Language
rule expression
addition
end
rule addition
multiplication (_ [+-] _ multiplication)*
end
rule multiplication
unary (_ [*/%] _ unary)*
end
rule unary
"-"? _ atom
end
rule atom
number / "(" _ expression _ ")"
end
rule number
float / integer
end
rule float
[0-9]+ "." [0-9]+
end
rule integer
[0-9]+
end
rule _
' '*
end
end
Related
I am been using treetop for a while. I wrote rules following
http://thingsaaronmade.com/blog/a-quick-intro-to-writing-a-parser-using-treetop.html
I can parse my whole input string but i none of the other to_array function gets triggered other than the initial one.
Then I found https://whitequark.org/blog/2011/09/08/treetop-typical-errors/ which talk about AST squashing and I figured out that my rule is doing the same.
The first rule I have is
rule bodies
blankLine* interesting:(body+) blankLine* <Bodies>
end
and everything is getting gobbled up by body.
Can someone suggest me what can I do to fix this?
Edit
Adding code snippet:
grammar Sexp
rule bodies
blankLine* interesting:(body+) blankLine* <Bodies>
end
rule body
commentPortString (ifdef_blocks / interface)+ (blankLine / end_of_file) <Body>
end
rule interface
space? (intf / intfWithSize) space? newLine <Interface>
end
rule commentPortString
space? '//' space portString space? <CommentPortString>
end
rule portString
'Port' space? '.' newLine <PortString>
end
rule expression
space? '(' body ')' space? <Expression>
end
rule intf
(input / output) space wire:wireName space? ';' <Intf>
end
rule intfWithSize
(input / output) space? width:ifWidth space? wire:wireName space? ';' <IntfWithSize>
end
rule input
'input' <InputDir>
end
rule output
'output' <OutputDir>
end
rule ifdef_blocks
ifdef_line (interface / ifdef_block)* endif_line <IfdefBlocks>
end
rule ifdef_block
ifdef_line interface* endif_line <IfdefBlocks>
end
rule ifdef_line
space? (ifdef / ifndef) space+ allCaps space? newLine <IfdefLine>
end
rule endif_line
space? (endif) space? newLine <EndifLine>
end
rule ifdef
'`ifdef' <Ifdef>
end
rule ifndef
'`ifndef' <Ifndef>
end
rule endif
'`endif' <Endif>
end
rule ifWidth
'[' space? msb:digits space? ':' space? lsb:digits ']' <IfWidth>
end
rule digits
[0-9]+ <Digits>
end
rule integer
('+' / '-')? [0-9]+ <IntegerLiteral>
end
rule float
('+' / '-')? [0-9]+ (('.' [0-9]+) / ('e' [0-9]+)) <FloatLiteral>
end
rule string
'"' ('\"' / !'"' .)* '"' <StringLiteral>
end
rule identifier
[a-zA-Z\=\*] [a-zA-Z0-9_\=\*]* <Identifier>
end
rule allCaps
[A-Z] [A-Z0-9_]*
end
rule wireName
[a-zA-Z] [a-zA-Z0-9_]* <WireName>
end
rule non_space
!space .
end
rule space
[^\S\n]+
end
rule non_space
!space .
end
rule blankLine
space* newLine
end
rule not_newLine
!newLine .
end
rule newLine
[\n\r]
end
rule end_of_file
!.
end
end
Test string
// Port.
input CLK;
// Port.
input REFCLK;
// Port.
input [ 41:0] mem_power_ctrl;
output data;
EDIT: Adding more details
The test code is checked into:
https://github.com/justrajdeep/treetop_ruby_issue.
As you would see in my node_extensions.rb all the nodes except the Bodies raise an exception in the method to_array. But none of the exceptions trigger.
You call to_array on tree, which is a Bodies. That is the only thing you ever call to_array on, so no other to_array method will be called.
If you want to_array to be called on child nodes of the Bodies node, Bodies#to_array needs to call to_array on those child nodes. So if you want to call it on the Body nodes you labelled interesting, you should iterate over interesting and call .to_array on each element.
Try breaking (body+) into a new rule like this:
rule bodies
blankLine* interesting:interesting blankLine* <Bodies>
end
rule interesting
body+ <Interesting>
end
Otherwise, it would be helpful to see the SyntaxNode classes.
Good morning everyone,
I'm currently trying to describe some basic Ruby grammar but I'm now stuck with parse space?
I can handle x = 1 + 1,
but can't parser x=1+1,
how can I parser space?
I have tried add enough space after every terminal.
but it can't parse,give a nil.....
How can I fix it?
Thank you very much, have a nice day.
grammar Test
rule main
s assign
end
rule assign
name:[a-z]+ s '=' s expression s
{
def to_ast
Assign.new(name.text_value.to_sym, expression.to_ast)
end
}
end
rule expression
add
end
rule add
left:brackets s '+' s right:add s
{
def to_ast
Add.new(left.to_ast, right.to_ast)
end
}
/
minus
end
rule minus
left:brackets s '-' s right:minus s
{
def to_ast
Minus.new(left.to_ast, right.to_ast)
end
}
/
brackets
end
rule brackets
'(' s expression ')' s
{
def to_ast
expression.to_ast
end
}
/
term
end
rule term
number / variable
end
rule number
[0-9]+ s
{
def to_ast
Number.new(text_value.to_i)
end
}
end
rule variable
[a-z]+ s
{
def to_ast
Variable.new(text_value.to_sym)
end
}
end
rule newline
s "\n"+ s
end
rule s
[ \t]*
end
end
this code works
problem Solved!!!!
It's not enough to define the space rule, you have to use it anywhere there might be space. Because this occurs often, I usually use a shorter rule name S for mandatory space, and the lowercase version s for optional space.
Then, as a principle, I skip optional space first in my top rule, and again after every terminal that can be followed by space. Terminals here are strings, character sets, etc. So at the start of assign, and before the {} block on variable, boolean, number, and also after your '=', '-' and '+' literals, add a call to the rule s to skip any spaces.
This policy works well for me. It's a good idea to have a test case which has minimum space, and another case that has maximum space (in all possible places).
I'm trying to process what will eventually be Boolean logic using a grammar in Treetop-like Citrus for Ruby. I'm getting a recursion issue, but I'm unclear on exactly why. Here's the text I'm trying to process (it should have a newline at the very end):
COMMANDWORD # MYCOMMENT
Here's my Citrus grammar (intended to deal with more advanced stuff):
grammar Grammar
rule commandset
command+
end
rule command
identifier command_detail* comment_to_eol* "\n"
end
rule command_detail
assign_expr | expr
end
rule assign_expr
identifier ":=" expr
end
rule expr
# Stack overflow
or_expr | gtor_expr
# No problem!
# or_expr
end
rule or_expr
# Temporarily match everything except a comment...
[^#]+
# What I think will be correct in the future...
# gtor_expr "OR" expr
end
rule gtor_expr
and_expr | gtand_expr
end
rule and_expr
gtand_expr "AND" gtor_expr
end
rule gtand_expr
not_expr | gtnot_expr
end
rule not_expr
"NOT" gtnot_expr | gtand_expr
end
rule gtnot_expr
parens_expr | identifier
end
rule parens_expr
"(" expr ")"
end
rule identifier
ws* [a-zA-Z0-9]+ ws*
end
rule ws
[ ]
end
rule comment_to_eol
"#" [^\n]*
end
end
The important things are in the rule expr and the rule or_expr. I've altered or_expr so it matches everything except a comment. If I stick with the current expr rule I get a stack overflow. But if I switch it so it doesn't have the choice between or_expr and gtor_expr it works fine.
My understanding of the "choice" is that it will try to evaluate them in order. If the first choice fails, then it will try the second. In this case, the first choice is obviously capable of succeeding, so why do I get a stack overflow if I include a second choice that should never be taken?
the loop may result because gtand_expr -> not_expr, and not_expr -> gtand_expr.
I think you can replace not_expr's rule with
not_expr -> "NOT" not_expr | gtnot_expr
And you should try simpler rules using regexp operators:
expr -> orexpr
orexpr -> andexpr ("OR" andexpr)*
andexpr -> notexpr ("AND" notexpr)*
notexpr -> "NOT"? atomicexpr
atomicexpr -> id | "(" expr ")"
I am having trouble avoiding left-recursion in this simple expression parser I'm working on. Essentially, I want to parse the equation 'f x y' into two expressions 'f x' and '(f x) y' (with implicit parentheses). How can I do this while avoiding left-recursion and backtracking? Does there have to be an intermediate step?
#!/usr/bin/env ruby
require 'rubygems'
require 'treetop'
Treetop.load_from_string DATA.read
parser = ExpressionParser.new
p parser.parse('f x y').value
__END__
grammar Expression
rule equation
expression (w+ expression)*
end
rule expression
expression w+ atom
end
rule atom
var / '(' w* expression w* ')'
end
rule var
[a-z]
end
rule w
[\s\n\t\r]
end
end
You haven't given enough information about your desired result. In particular, do you expect "f(a b) y" to parse as "(f(a(b))) y"? I assume you do... which means that a function not followed by an open parenthesis has arity one.
So you want to say:
rule equation
expression w* var / expression w* parenthesised_list
end
rule parenthesised_list
'(' w* ( expression w* )+ ')'
end
If on the other hand you have external (to the grammar) knowledge of the arity of f, and you want to iterate "expression" exactly that many times - as happens in parsing TeX for example - then you will need to use a semantic predicate &{|s| ...} inside the iterated expression list). Beware that the argument passed to the block of a sempred is not a SyntaxNode (which cannot yet be constructed because this sequence sub-rule has not yet succeeded) but the accumulated array of nodes so far in the sequence. The truthiness of the block return value dictates the parse result and can stop the iteration.
Another tool you might consider using is lookahead (!stuff_I_dont_expect_to_follow or &stuff_that_must_follow).
You can also ask such questions in http://groups.google.com/group/treetop-dev
I have this working pair of rules in Treetop that the perfectionist in me believes should be one and only one rule, or maybe something more beautiful at least:
rule _
crap
/
" "*
end
rule crap
" "* "\\x0D\\x0A"* " "*
end
I'm parsing some expressions that every now and then ended up with "\x0D\x0A". Yeah, not "\r\n" but "\x0D\x0A". Something was double escaped at some point. Long story.
That rule works, but it's ugly and it bothers me. I tried this:
rule _
" "* "\\x0D\\x0A"* " "*
/
" "*
end
which caused
SyntaxError: (eval):1276:in `load_from_string': compile error
(eval):1161: class/module name must be CONSTANT
from /.../gems/treetop-1.4.9/lib/treetop/compiler/grammar_compiler.rb:42:in `load_from_string'
from /.../gems/treetop-1.4.9/lib/treetop/compiler/grammar_compiler.rb:35:in `load'
from /.../gems/treetop-1.4.9/lib/treetop/compiler/grammar_compiler.rb:32:in `open'
from /.../gems/treetop-1.4.9/lib/treetop/compiler/grammar_compiler.rb:32:in `load'
Ideally I would like to actually write something like:
rule _
(" " | "\\x0D\\x0A")*
end
but that doesn't work, and while we are at it, I also discovered that you can't have only one * per rule:
rule _
" "*
/
"\n"*
end
that will match " ", but never \n.
I see you're using three different OR chars: /, | and \ (of which only the first means OR).
This works fine:
grammar Language
rule crap
(" " / "\\x0D\\x0A")* {
def value
text_value
end
}
end
end
#!/usr/bin/env ruby
require 'rubygems'
require 'treetop'
require 'polyglot'
require 'language'
parser = LanguageParser.new
value = parser.parse(' \\x0D\\x0A \\x0D\\x0A ').value
print '>' + value + '<'
prints:
> \x0D\x0A \x0D\x0A <
You said "I also discovered that you can't have only one * per rule" (you mean: you CAN have), "that will match " ", but never \n".
Of course; the rule succeeds when it matches zero space characters. You could just use a + instead:
rule _
" "+
/
"\n"*
end
You could also parenthesise the space characters if you want to match any number of space-or-newline characters:
rule _
(" " / "\n")*
end
Your error "class/module name must be CONSTANT" is because the rule name is used as the prefix of a module name to contain any methods attached to your rule. A module name may not begin with an underscore, so you can't use methods in a rule whose name begins with an underscore.