Treetop infinite loop when parsing Latex document - ruby

I'm trying to write a parser with treetop to parse some latex commands into HTML markup. With the following I get a deadspin in generated code. I've build the source code with tt and stepped through but it doesn't really elucidate what the underlying issue is (it just spins in _nt_paragraph)
Test input: "\emph{hey} and some more text."
grammar Latex
rule document
(paragraph)* {
def content
[:document, elements.map { |e| e.content }]
end
}
end
# Example: There aren't the \emph{droids you're looking for} \n\n.
rule paragraph
( text / tag )* eop {
def content
[:paragraph, elements.map { |e| e.content } ]
end
}
end
rule text
( !( tag_start / eop) . )* {
def content
[:text, text_value ]
end
}
end
# Example: \tag{inner_text}
rule tag
"\\emph{" inner_text '}' {
def content
[:tag, inner_text.content]
end
}
end
# Example: \emph{inner_text}
rule inner_text
( !'}' . )* {
def content
[:inner_text, text_value]
end
}
end
# End of paragraph.
rule eop
newline 2.. {
def content
[:newline, text_value]
end
}
end
rule newline
"\n"
end
# You know, what starts a tag
rule tag_start
"\\"
end
end

For anyone curious, Clifford over at the treetop dev google group figured this out.
The problem was with paragraph and text.
Text is 0 or more characters, and there can be 0 or more texts in a paragraph, so what was happening was there was an infinite amount of 0 length characters before the first \n, causing the parser to dead spin. The fix was to adjust text to be:
( !( tag_start / eop) . )+
So that it must have at least one character to match.

Related

Treetop parser : how to handle spaces?

Good morning everyone,
I'm currently trying to describe some basic Ruby grammar but I'm now stuck with parse space?
I can handle x = 1 + 1,
but can't parser x=1+1,
how can I parser space?
I have tried add enough space after every terminal.
but it can't parse,give a nil.....
How can I fix it?
Thank you very much, have a nice day.
grammar Test
rule main
s assign
end
rule assign
name:[a-z]+ s '=' s expression s
{
def to_ast
Assign.new(name.text_value.to_sym, expression.to_ast)
end
}
end
rule expression
add
end
rule add
left:brackets s '+' s right:add s
{
def to_ast
Add.new(left.to_ast, right.to_ast)
end
}
/
minus
end
rule minus
left:brackets s '-' s right:minus s
{
def to_ast
Minus.new(left.to_ast, right.to_ast)
end
}
/
brackets
end
rule brackets
'(' s expression ')' s
{
def to_ast
expression.to_ast
end
}
/
term
end
rule term
number / variable
end
rule number
[0-9]+ s
{
def to_ast
Number.new(text_value.to_i)
end
}
end
rule variable
[a-z]+ s
{
def to_ast
Variable.new(text_value.to_sym)
end
}
end
rule newline
s "\n"+ s
end
rule s
[ \t]*
end
end
this code works
problem Solved!!!!
It's not enough to define the space rule, you have to use it anywhere there might be space. Because this occurs often, I usually use a shorter rule name S for mandatory space, and the lowercase version s for optional space.
Then, as a principle, I skip optional space first in my top rule, and again after every terminal that can be followed by space. Terminals here are strings, character sets, etc. So at the start of assign, and before the {} block on variable, boolean, number, and also after your '=', '-' and '+' literals, add a call to the rule s to skip any spaces.
This policy works well for me. It's a good idea to have a test case which has minimum space, and another case that has maximum space (in all possible places).

Insert text before the end of a file

I am trying to write a script that will insert a text before the last end tag within a Ruby file. For example, I want to insert the following:
def hello
puts "hello!"
end
within the following file, just before the end of the class:
class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
helper_method :authenticated?, :current_user
def current_user?
session[:current_user]
end
end
The result should look like this:
class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
helper_method :authenticated?, :current_user
def current_user?
session[:current_user]
end
def hello
puts "hello!"
end
end
I have tried to find a regex that would match the last occurence of end and replace it with the block I want to add but all regexes I have tried match the first end only. Tried these:
end(?=[^end]*$)
end(?!.*end)
(.*)(end)(.*)
To replace the string, I do the following (maybe the EOL characters are screwing up the matching?):
file_to_override = File.read("app/controllers/application_controller.rb")
file_to_override = file_to_override.sub(/end(?=[^end]*$)/, "#{new_string}\nend")
EDIT: I also tried with the solution provided in How to replace the last occurrence of a substring in ruby? but strangely, it replaces all occurences of end.
What am I doing wrong? Thanks!
The approach explained in the post is working here, too. You just need to re-organize capturing groups and use the /m modifier that forces . to match newline symbols, too.
new_string = <<EOS
def hello
puts "Hello!"
end
EOS
file_to_override = <<EOS
class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
helper_method :authenticated?, :current_user
def current_user?
session[:current_user]
end
end
EOS
file_to_override=file_to_override.gsub(/(.*)(\nend\b.*)/m, "\\1\n#{new_string}\\2")
puts file_to_override
See IDEONE demo
The /(.*)(\nend\b.*)/m pattern will match and capture into Group 1 all the text up to the last whole word (due to the \n before and \b after) end preceded with a line feed, and will place the line feed, "end" and whatever remains into Group 2. In the replacement, we back-reference the captured substrings with backreferences \1 and \2 and also insert the string we need to insert.
If there are no other words after the last end, you could even use a /(.*)(\nend\s*\z)/m regex.
Suppose you read the file into the string text:
text = <<_
class A
def a
'hi'
end
end
_
and wish to insert the string to_enter:
to_enter = <<_
def hello
puts "hello!"
end
_
before the last end. You could write
r = /
.* # match any number of any character (greedily)
\K # discard everything matched so far
(?=\n\s*end\b) # match end-of-line, indenting spaces, and "end" followed
# by a word break in a positive lookahead
/mx # multi-line and extended/free-spacing regex definition modes
puts text.sub(r, to_enter)
(prints)
class A
def a
'hi'
end
def hello
puts "hello!"
end
end
Note that sub is replacing an empty string with to_enter.
Edit: Answer from Wiktor is exactly what I was looking for. Leaving the following too because it works as well.
Finally, I gave up on replacing using a regex. Instead, I use the position of the last end:
positions = file_to_override.enum_for(:scan, /end/).map { Regexp.last_match.begin(0) }
Then, before writing the file, I add what I need within the string at last position - 1:
new_string = <<EOS
def hello
puts "Hello!"
end
EOS
file_to_override[positions.last - 1] = "\n#{test_string}\n"
File.open("app/controllers/application_controller.rb", 'w') {|file| file.write(file_to_override)}
This works but it doesn't look like idiomatic Ruby to me.
You can also find and replace the last occurence of "end" (note that this will also match the end in # Hello my friend, but see below) like this
# Our basics: In this text ...
original_content = "# myfile.rb\n"\
"module MyApp\n"\
" class MyFile\n"\
" def myfunc\n"\
" end\n"\
" end\n"\
"end\n"
# ...we want to inject this:
substitute = "# this will come to a final end!\n"\
"end\n"
# Now find the last end ...
idx = original_content.rindex("end") # => index of last "end"(69)
# ... and substitute it
original_content[idx..idx+3] = substitute # (3 = "end".length)
This solution is somewhat more old-school (dealing with indexes in strings felt much cooler some years ago) and in this form more "vulnerable" but avoids you to sit down and digest the regexps. Dont get me wrong, regular expressions are a tool of incredible power and the minutes learning them are worth it.
That said, you can use all the regular expressions from the other answers also with rindex (e.g. rindex(/ *end/)).

Rule's order does matter in TreeTop?

I am just starting to use TreeTop to do parsing works. The following is the snippets that puzzles me:
grammar Fortran
rule integer
[1-9] [0-9]*
end
rule id
[a-zA-Z] [a-zA-Z0-9]*
end
end
parser = FortranParser.new
ast = parser.parse('1')
The result ast is:
[SyntaxNode offset=0, "1", SyntaxNode offset=1, ""]
But when I place rule id above rule integer, the result is nil. So what is the problem? Thanks in advance!
I think I just figured out where is wrong!!! There should be a top rule that includes other rules, which is placed as the first rule:
grammar Fortran
rule statement
( id / integer )* {
def content
elements.map { |e| e.content }
end
}
end
rule id
[a-zA-Z] [a-zA-Z0-9]* {
def content
[:id, text_value]
end
}
end
rule integer
[1-9] [0-9]* {
def content
[:integer, text_value]
end
}
end
end
parser = FortranParser.new
ast = parser.parse('1')
Then the result is
[[:integer, "1"]]

Ruby Regex not matching

I'm writing a short class to extract email addresses from documents. Here is my code so far:
# Class to scrape documents for email addresses
class EmailScraper
EmailRegex = /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
def EmailScraper.scrape(doc)
email_addresses = []
File.open(doc) do |file|
while line = file.gets
temp = line.scan(EmailRegex)
temp.each do |email_address|
puts email_address
emails_addresses << email_address
end
end
end
return email_addresses
end
end
if EmailScraper.scrape("email_tests.txt").empty?
puts "Empty array"
else
puts EmailScraper.scrape("email_tests.txt")
end
My "email_tests.txt" file looks like so:
example#live.com
another_example90#hotmail.com
example3#diginet.ie
When I run this script, all I get is the "Empty array" printout. However, when I fire up irb and type in the regex above, strings of email addresses match it, and the String.scan function returns an array of all the email addresses in each string. Why is this working in irb and not in my script?
Several things (some already mentioned and expanded upon below):
\z matches to the end of the string, which with IO#gets will typically include a \n character. \Z (upper case 'z') matches the end of the string unless the string ends with a \n, in which case it matches just before.
the typo of emails_addresses
using \A and \Z is fine while the entire line is or is not an email address. You say you're seeking to extract addresses from documents, however, so I'd consider using \b at each end to extract emails delimited by word boundaries.
you could use File.foreach()... rather than the clumsy-looking File.open...while...gets thing
I'm not convinced by the Regex - there's a substantial body of work already around:
There's a smarter one here: http://www.regular-expressions.info/email.html (clicking on that odd little in-line icon takes you to a piece-by-piece explanation). It's worth reading the discussion, which points out several potential pitfalls.
Even more mind-bogglingly complex ones may be found here.
class EmailScraper
EmailRegex = /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\Z/i # changed \z to \Z
def EmailScraper.scrape(doc)
email_addresses = []
File.foreach(doc) do |line| # less code, same effect
temp = line.scan(EmailRegex)
temp.each do |email_address|
email_addresses << email_address
end
end
email_addresses # "return" isn't needed
end
end
result = EmailScraper.scrape("email_tests.txt") # store it so we don't print them twice if successful
if result.empty?
puts "Empty array"
else
puts result
end
Looks like you're putting the results into emails_addresses, but are returning email_addresses. This would mean that you're always returning the empty array you defined for email_addresses, making the "Empty array" response correct.
You have a typo, try with:
class EmailScraper
EmailRegex = /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
def EmailScraper.scrape(doc)
email_addresses = []
File.open(doc) do |file|
while line = file.gets
temp = line.scan(EmailRegex)
temp.each do |email_address|
puts email_address
email_addresses << email_address
end
end
end
return email_addresses
end
end
if EmailScraper.scrape("email_tests.txt").empty?
puts "Empty array"
else
puts EmailScraper.scrape("email_tests.txt")
end
You used at the end \z try to use \Z according to http://www.regular-expressions.info/ruby.html it has to be a uppercase Z to match the end of the string.
Otherwise try to use ^ and $ (matching the start and the end of a row) this worked for me here on Regexr
When you read the file, the end of line is making the regex fail. In irb, there probably is no end of line. If that is the case, chomp the lines first.
regex=/\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
line_from_irb = "example#live.com"
line_from_file = line_from_irb +"/n"
p line_from_irb.scan(regex) # => ["example#live.com"]
p line_from_file.scan(regex) # => []

matching tag pairs in Treetop grammar

I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):
rule html_tag_pair
html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
whitespace))+ html_close_tag <HTMLTagPair>
end
I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:
rule newline
[\n\r] {
def content
:newline
end
}
end
rule tab
"\t" {
def content
:tab
end
}
end
rule whitespace
(newline / tab / [\s]) {
def content
:whitespace
end
}
end
rule text
[^<]+ {
def content
[:text, text_value]
end
}
end
rule html_open_tag
"<" html_tag_name attribute_list ">" <HTMLOpenTag>
end
rule html_empty_tag
"<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end
rule html_close_tag
"</" html_tag_name ">" <HTMLCloseTag>
end
rule html_tag_name
[A-Za-z0-9]+ {
def content
text_value
end
}
end
rule attribute_list
attribute* {
def content
elements.inject({}){ |hash, e| hash.merge(e.content) }
end
}
end
rule attribute
whitespace+ html_tag_name "=" quoted_value {
def content
{elements[1].content => elements[3].content}
end
}
end
rule quoted_value
('"' [^"]* '"' / "'" [^']* "'") {
def content
elements[1].text_value
end
}
end
I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?
Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.
grammar SimpleXML
rule document
(text / tag)*
end
rule text
[^<]+
end
rule tag
"<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
end
end
You can only do this using either a separate rule for each HTML tag pair, or using a semantic predicate. That is, by saving the opening tag (in a sempred), then accepting (in another sempred) a closing tag only if it is the same tag. This is much harder to do in Treetop than it should be, because there's no convenient place to save the context and you can't peek up the parser stack, but it is possible.
BTW, the same problem occurs in parsing MIME boundaries (and in Markdown). I haven't checked Mikel's implementation in ActionMailer (probably he uses a nested Mime parser for that), but it is possible in Treetop.
In http://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/parser.rb I save context in a fake input stream - you can see what methods it has to support - because "input" is available on all SyntaxNodes. I have a different kind of reason for using sempreds there, but some of the techniques are applicable.

Resources