matching tag pairs in Treetop grammar - ruby

I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):
rule html_tag_pair
html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
whitespace))+ html_close_tag <HTMLTagPair>
end
I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:
rule newline
[\n\r] {
def content
:newline
end
}
end
rule tab
"\t" {
def content
:tab
end
}
end
rule whitespace
(newline / tab / [\s]) {
def content
:whitespace
end
}
end
rule text
[^<]+ {
def content
[:text, text_value]
end
}
end
rule html_open_tag
"<" html_tag_name attribute_list ">" <HTMLOpenTag>
end
rule html_empty_tag
"<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end
rule html_close_tag
"</" html_tag_name ">" <HTMLCloseTag>
end
rule html_tag_name
[A-Za-z0-9]+ {
def content
text_value
end
}
end
rule attribute_list
attribute* {
def content
elements.inject({}){ |hash, e| hash.merge(e.content) }
end
}
end
rule attribute
whitespace+ html_tag_name "=" quoted_value {
def content
{elements[1].content => elements[3].content}
end
}
end
rule quoted_value
('"' [^"]* '"' / "'" [^']* "'") {
def content
elements[1].text_value
end
}
end
I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?

Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.
grammar SimpleXML
rule document
(text / tag)*
end
rule text
[^<]+
end
rule tag
"<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
end
end

You can only do this using either a separate rule for each HTML tag pair, or using a semantic predicate. That is, by saving the opening tag (in a sempred), then accepting (in another sempred) a closing tag only if it is the same tag. This is much harder to do in Treetop than it should be, because there's no convenient place to save the context and you can't peek up the parser stack, but it is possible.
BTW, the same problem occurs in parsing MIME boundaries (and in Markdown). I haven't checked Mikel's implementation in ActionMailer (probably he uses a nested Mime parser for that), but it is possible in Treetop.
In http://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/parser.rb I save context in a fake input stream - you can see what methods it has to support - because "input" is available on all SyntaxNodes. I have a different kind of reason for using sempreds there, but some of the techniques are applicable.

Related

Treetop parser : how to handle spaces?

Good morning everyone,
I'm currently trying to describe some basic Ruby grammar but I'm now stuck with parse space?
I can handle x = 1 + 1,
but can't parser x=1+1,
how can I parser space?
I have tried add enough space after every terminal.
but it can't parse,give a nil.....
How can I fix it?
Thank you very much, have a nice day.
grammar Test
rule main
s assign
end
rule assign
name:[a-z]+ s '=' s expression s
{
def to_ast
Assign.new(name.text_value.to_sym, expression.to_ast)
end
}
end
rule expression
add
end
rule add
left:brackets s '+' s right:add s
{
def to_ast
Add.new(left.to_ast, right.to_ast)
end
}
/
minus
end
rule minus
left:brackets s '-' s right:minus s
{
def to_ast
Minus.new(left.to_ast, right.to_ast)
end
}
/
brackets
end
rule brackets
'(' s expression ')' s
{
def to_ast
expression.to_ast
end
}
/
term
end
rule term
number / variable
end
rule number
[0-9]+ s
{
def to_ast
Number.new(text_value.to_i)
end
}
end
rule variable
[a-z]+ s
{
def to_ast
Variable.new(text_value.to_sym)
end
}
end
rule newline
s "\n"+ s
end
rule s
[ \t]*
end
end
this code works
problem Solved!!!!
It's not enough to define the space rule, you have to use it anywhere there might be space. Because this occurs often, I usually use a shorter rule name S for mandatory space, and the lowercase version s for optional space.
Then, as a principle, I skip optional space first in my top rule, and again after every terminal that can be followed by space. Terminals here are strings, character sets, etc. So at the start of assign, and before the {} block on variable, boolean, number, and also after your '=', '-' and '+' literals, add a call to the rule s to skip any spaces.
This policy works well for me. It's a good idea to have a test case which has minimum space, and another case that has maximum space (in all possible places).

How do I break up a string around "{tags}"?

I am writing a function which can have two potential forms of input:
This is {a {string}}
This {is} a {string}
I call the sub-strings wrapped in curly-brackets "tags". I could potentially have any number of tags in a string, and they could be nested arbitrarily deep.
I've tried writing a regular expression to grab the tags, which of course fails on the nested tags, grabbing {a {string}, missing the second curly bracket. I can see it as a recursive problem, but after staring at the wrong answer too long I feel like I'm blind to seeing something really obvious.
What can I do to separate out the potential tags into parts so that they can be processed and replaced?
The More Complicated Version
def parseTags( oBody, szText )
if szText.match(/\{(.*)\}/)
szText.scan(/\{(.*)\}/) do |outers|
outers.each do |blah|
if blah.match(/(.*)\}(.*)\{(.*)/)
blah.scan(/(.*)\}(.*)\{(.*)/) do |inners|
inners.each do |tags|
szText = szText.sub("\{#{tags}\}", parseTags( oBody, tags ))
end
end
else
szText = szText.sub("\{#{blah}\}", parseTags( oBody, blah ))
end
end
end
end
if szText.match(/(\w+)\.(\w+)(?:\.([A-Za-z0-9.\[\]": ]*))/)
func = $1+"_"+$2
begin
szSub = self.send func, oBody, $3
rescue Exception=>e
szSub = "{Error: Function #{$1}_#{$2} not found}"
$stdout.puts "DynamicIO Error Encountered: #{e}"
end
szText = szText.sub("#{$1}.#{$2}#{$3!=nil ? "."+$3 : ""}", szSub)
end
return szText
end
This was the result of tinkering too long. It's not clean, but it did work for a case similar to "1" - {help.divider.red.sys.["{pc.login}"]} is replaced with ---------------[ Duwnel ]---------------. However, {pc.attr.str.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.pre.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.int.dotmode} implodes brilliantly, with random streaks of red and swatches of missing text.
To explain, anything marked {ansi.col.red} marks an ansi red code, reset escapes the color block, and {pc.attr.XXX.dotmode} displays a number between 1 and 10 in "o"s.
As others have noted, this is a perfect case for a parsing engine. Regular expressions don't tend to handle nested pairs well.
Treetop is an awesome PEG parser that you might be interested in taking a look at. The main idea is that you define everything that you want to parse (including whitespace) inside rules. The rules allow you to recursively parse things like bracket pairs.
Here's an example grammar for creating arrays of strings from nested bracket pairs. Usually grammars are defined in a separate file, but for simplicity I included the grammar at the end and loaded it with Ruby's DATA constant.
require 'treetop'
Treetop.load_from_string DATA.read
parser = BracketParser.new
p parser.parse('This is {a {string}}').value
#=> ["This is ", ["a ", ["string"]]]
p parser.parse('This {is} a {string}').value
#=> ["This ", ["is"], " a ", ["string"]]
__END__
grammar Bracket
rule string
(brackets / not_brackets)+
{
def value
elements.map{|e| e.value }
end
}
end
rule brackets
'{' string '}'
{
def value
elements[1].value
end
}
end
rule not_brackets
[^{}]+
{
def value
text_value
end
}
end
end
I would recommend instead of fitting more complex regular expressions to this problem, that you look into one of Ruby's grammar-based parsing engines. It is possible to design recursive and nested grammars in most of these.
parslet might be a good place to start for your problem. The erb-alike example, although it does not demonstrate nesting, might be closest to your needs: https://github.com/kschiess/parslet/blob/master/example/erb.rb

Treetop infinite loop when parsing Latex document

I'm trying to write a parser with treetop to parse some latex commands into HTML markup. With the following I get a deadspin in generated code. I've build the source code with tt and stepped through but it doesn't really elucidate what the underlying issue is (it just spins in _nt_paragraph)
Test input: "\emph{hey} and some more text."
grammar Latex
rule document
(paragraph)* {
def content
[:document, elements.map { |e| e.content }]
end
}
end
# Example: There aren't the \emph{droids you're looking for} \n\n.
rule paragraph
( text / tag )* eop {
def content
[:paragraph, elements.map { |e| e.content } ]
end
}
end
rule text
( !( tag_start / eop) . )* {
def content
[:text, text_value ]
end
}
end
# Example: \tag{inner_text}
rule tag
"\\emph{" inner_text '}' {
def content
[:tag, inner_text.content]
end
}
end
# Example: \emph{inner_text}
rule inner_text
( !'}' . )* {
def content
[:inner_text, text_value]
end
}
end
# End of paragraph.
rule eop
newline 2.. {
def content
[:newline, text_value]
end
}
end
rule newline
"\n"
end
# You know, what starts a tag
rule tag_start
"\\"
end
end
For anyone curious, Clifford over at the treetop dev google group figured this out.
The problem was with paragraph and text.
Text is 0 or more characters, and there can be 0 or more texts in a paragraph, so what was happening was there was an infinite amount of 0 length characters before the first \n, causing the parser to dead spin. The fix was to adjust text to be:
( !( tag_start / eop) . )+
So that it must have at least one character to match.

Treetop infinite recursion with negative rule

I have the following treetop grammar:
grammar TestGrammar
rule body
text / expression
end
rule text
not_delimiter*
end
rule expression
delimiter text delimiter
end
rule delimiter
'$'
end
rule not_delimiter
!delimiter
end
end
When I try to parse an expression, eg 'hello world $test$', the script goes in an infinite loop.
The problem seems to come from the not_delimiter rule, as when I remove it the expression get parsed.
What is the problem with this grammar?
Thanks in advance.
The problem seems to be where you are attempting to match:
rule text
not_delimiter*
end
Since the * will also match nothing you have the possibility of matching [^$]*, which I think is what is causing the infinite loop.
Also, you need to match multiple bodies at the starting rule, otherwise it will return nil, since you will only ever match either a text rule or an expression rule but not both.
rule bodies
body+
end
This will parse:
require 'treetop'
Treetop.load_from_string DATA.read
parser = TestGrammarParser.new
p parser.parse "hello world $test$"
__END__
grammar TestGrammar
rule bodies
body+
end
rule body
expression / text
end
rule expression
delimiter text delimiter
end
rule text
not_delimiter+
end
rule not_delimiter
[^$]
end
rule delimiter
'$'
end
end

Treetop grammar issues using regular expressions

I have a simple grammar setup like so:
grammar Test
rule line
(adjective / not_adjective)* {
def content
elements.map{|e| e.content }
end
}
end
rule adjective
("good" / "bad" / "excellent") {
def content
[:adjective, text_value]
end
}
end
rule not_adjective
!adjective {
def content
[:not_adjective, text_value]
end
}
end
end
Let's say my input is "this is a good ball. let's use it". This gives an error, which I'm not mentioning right now because I want to understand the theory about why its wrong first.
So, how do I create rule not_adjective so that it matches anything that is not matched by rule adjective? In general, how to I write I rule (specifically in Treetop) that "doesnt" match another named rule?
Treetop is a parser generator that generates parsers out of a special class of grammars called Parsing Expression Grammars or PEG.
The operational interpretation of !expression is that it succeeds if expression fails and fails if expression succeeds but it consumes NO input.
To match anything that rule expression does not match use the dot operator (that matches anything) in conjunction with the negation operator to avoid certain "words":
( !expression . )* ie. "match anything BUT expression"
The previous answer is incorrect for the OP's question, since it will match any sequence of individual characters up to any adjective. So if you see the string xyzgood, it'll match xyz and a following rule will match the "good" part as an adjective. Likewise, the adjective rule of the OP will match the first three characters of "badge" as the adjective "bad", which isn't what they want.
Instead, the adjective rule should look something like this:
rule adjective
a:("good" / "bad" / "excellent") ![a-z] {
def content
[:adjective, a.text_value]
end
}
end
and the not_adjective rule like this:
rule not_adjective
!adjective w:([a-z]+) {
def content
[:not_adjective, w.text_value]
end
}
end
include handling for upper-case, hyphenation, apostrophes, etc, as necessary. You'll also need white-space handling, of course.

Resources