Treetop parser : how to handle spaces? - ruby

Good morning everyone,
I'm currently trying to describe some basic Ruby grammar but I'm now stuck with parse space?
I can handle x = 1 + 1,
but can't parser x=1+1,
how can I parser space?
I have tried add enough space after every terminal.
but it can't parse,give a nil.....
How can I fix it?
Thank you very much, have a nice day.
grammar Test
rule main
s assign
end
rule assign
name:[a-z]+ s '=' s expression s
{
def to_ast
Assign.new(name.text_value.to_sym, expression.to_ast)
end
}
end
rule expression
add
end
rule add
left:brackets s '+' s right:add s
{
def to_ast
Add.new(left.to_ast, right.to_ast)
end
}
/
minus
end
rule minus
left:brackets s '-' s right:minus s
{
def to_ast
Minus.new(left.to_ast, right.to_ast)
end
}
/
brackets
end
rule brackets
'(' s expression ')' s
{
def to_ast
expression.to_ast
end
}
/
term
end
rule term
number / variable
end
rule number
[0-9]+ s
{
def to_ast
Number.new(text_value.to_i)
end
}
end
rule variable
[a-z]+ s
{
def to_ast
Variable.new(text_value.to_sym)
end
}
end
rule newline
s "\n"+ s
end
rule s
[ \t]*
end
end
this code works
problem Solved!!!!

It's not enough to define the space rule, you have to use it anywhere there might be space. Because this occurs often, I usually use a shorter rule name S for mandatory space, and the lowercase version s for optional space.
Then, as a principle, I skip optional space first in my top rule, and again after every terminal that can be followed by space. Terminals here are strings, character sets, etc. So at the start of assign, and before the {} block on variable, boolean, number, and also after your '=', '-' and '+' literals, add a call to the rule s to skip any spaces.
This policy works well for me. It's a good idea to have a test case which has minimum space, and another case that has maximum space (in all possible places).

Related

Treetop infinite recursion with negative rule

I have the following treetop grammar:
grammar TestGrammar
rule body
text / expression
end
rule text
not_delimiter*
end
rule expression
delimiter text delimiter
end
rule delimiter
'$'
end
rule not_delimiter
!delimiter
end
end
When I try to parse an expression, eg 'hello world $test$', the script goes in an infinite loop.
The problem seems to come from the not_delimiter rule, as when I remove it the expression get parsed.
What is the problem with this grammar?
Thanks in advance.
The problem seems to be where you are attempting to match:
rule text
not_delimiter*
end
Since the * will also match nothing you have the possibility of matching [^$]*, which I think is what is causing the infinite loop.
Also, you need to match multiple bodies at the starting rule, otherwise it will return nil, since you will only ever match either a text rule or an expression rule but not both.
rule bodies
body+
end
This will parse:
require 'treetop'
Treetop.load_from_string DATA.read
parser = TestGrammarParser.new
p parser.parse "hello world $test$"
__END__
grammar TestGrammar
rule bodies
body+
end
rule body
expression / text
end
rule expression
delimiter text delimiter
end
rule text
not_delimiter+
end
rule not_delimiter
[^$]
end
rule delimiter
'$'
end
end

How do I obtain the (possibly nested) capture groups in a regular expression?

Given a regular expression:
/say (hullo|goodbye) to my lovely (.*)/
and a string:
"my $2 is happy that you said $1"
What is the best way to obtain a regular expression from the string that contains the capture groups in the regular expression? That is:
/my (.*) is happy that you said (hullo|goodbye)/
Clearly I could use regular expressions on a string representation of the original regular expression, but this would probably present difficulties with nested capture groups.
I'm using Ruby. My simple implementation so far goes along the lines of:
class Regexp
def capture_groups
self.to_s[1..-2].scan(/\(.*?\)/)
end
end
regexp.capture_groups.each_with_index do |capture, idx|
string.gsub!("$#{idx+1}", capture)
end
/^#{string}$/
i guess you need to create your own function that would do this:
create empty dictionaries groups and active_groups and initialize counter = 1
iterate over the characters in the string representation:
if current character = '(' and previous charaster != \:
add counter key to active_groups and increase counter
add current character to all active_groups
if current character = ')' and previous charaster != \:
remove the last item (key, value) from active_groups and add it to groups
convert groups to an array if needed
You might also want to implement:
ignore = True between unescaped '[' and ']'
reset counter if current character = '|' and active_groups is empty (or decrease counter if active_group is not empty)
UPDATES from comments:
ingore non-capturing groups starting with '(?:'
So once I realised that what I actually need is a regular expression parser, things started falling into place. I discovered this project:
https://github.com/dche/randall
which can generate strings that match a regular expression. It defines a regular expression grammar using http://treetop.rubyforge.org/. Unfortunately the grammar it defines is incomplete, though useful for many cases.
I also stumbled past https://github.com/mjijackson/citrus, which does a similar job to Treetop.
I then found this mind blowing gem:
https://github.com/ammar/regexp_parser
which defines a full regexp grammar and parses a regular expression into a walkable tree. I was then able to walk the tree and pick out the parts of the tree I wanted (the capture groups).
Unfortunately there was a minor bug, fixed in my fork: https://github.com/LaunchThing/regexp_parser.
Here's my patch to Regexp, that uses the fixed gem:
class Regexp
def parse
Regexp::Parser.parse(self.to_s, 'ruby/1.9')
end
def walk(e = self.parse, depth = 0, &block)
block.call(e, depth)
unless e.expressions.empty?
e.each do |s|
walk(s, depth+1, &block)
end
end
end
def capture_groups
capture_groups = []
walk do |e, depth|
capture_groups << e.to_s if Regexp::Expression::Group::Capture === e
end
capture_groups
end
end
I can then use this in my application to make replacements in my string - the final goal - along these lines:
from = /^\/search\/(.*)$/
to = '/buy/$1'
to_as_regexp = to.dup
# I should probably make this gsub tighter
from.capture_groups.each_with_index do |capture, idx|
to_as_regexp.gsub!("$#{idx+1}", capture)
end
to_as_regexp = /^#{to_as_regexp}$/
# to_as_regexp = /^\/buy\/(.*)$/
I hope this helps someone else out.

Remove unmatched parentheses from a string

I want to remove "un-partnered" parentheses from a string.
I.e., all ('s should be removed unless they're followed by a ) somewhere in the string. Likewise, all )'s not preceded by a ( somewhere in the string should be removed.
Ideally the algorithm would take into account nesting as well.
E.g.:
"(a)".remove_unmatched_parents # => "(a)"
"a(".remove_unmatched_parents # => "a"
")a(".remove_unmatched_parents # => "a"
Instead of a regex, consider a push-down automata, perhaps. (I'm not sure if Ruby regular expressions can handle this, I believe Perl's can).
A (very trivialized) process may be:
For each character in the input string:
If it is not a '(' or ')' then just append it to the output
If it is a '(' increase a seen_parens counter and add it
If it is a ')' and seen_parens is > 0, add it and decrease seen_parens. Otherwise skip it.
At the end of the process, if seen_parens is > 0 then remove that many parens, starting from the end. (This step can be merged into the above process with use of a stack or recursion.)
The entire process is O(n), even if a relatively high overhead
Happy coding.
The following uses oniguruma. Oniguruma is the regex engine built in if you are using ruby1.9. If you are using ruby1.8, see this: oniguruma.
Update
I had been so lazy to just copy and paste someone else's regex. It seemed to have problem.
So now, I wrote my own. I believe it should work now.
class String
NonParenChar = /[^\(\)]/
def remove_unmatched_parens
self[/
(?:
(?<balanced>
\(
(?:\g<balanced>|#{NonParenChar})*
\)
)
|#{NonParenChar}
)+
/x]
end
end
(?<name>regex1) names the (sub)regex regex1 as name, and makes it possible to be called.
?g<name> will be a subregex that represents regex1. Note here that ?g<name> does not represent a particular string that matched regex1, but it represents regex1 itself. In fact, it is possible to embed ?g<name> within (?<name>...).
Update 2
This is simpler.
class String
def remove_unmatched_parens
self[/
(?<valid>
\(\g<valid>*\)
|[^()]
)+
/x]
end
end
Build a simple LR parser:
tokenize, token, stack = false, "", []
")(a))(()(asdf)(".each_char do |c|
case c
when '('
tokenize = true
token = c
when ')'
if tokenize
token << c
stack << token
end
tokenize = false
when /\w/
token << c if tokenize
end
end
result = stack.join
puts result
running yields:
wesbailey#feynman:~/code_katas> ruby test.rb
(a)()(asdf)
I don't agree with the folks modifying the String class because you should never open a standard class. Regexs are pretty brittle for parser and hard to support. I couldn't imagine coming back to the previous solutions 6 months for now and trying to remember what they were doing!
Here's my solution, based on #pst's algorithm:
class String
def remove_unmatched_parens
scanner = StringScanner.new(dup)
output = ''
paren_depth = 0
while char = scanner.get_byte
if char == "("
paren_depth += 1
output << char
elsif char == ")"
output << char and paren_depth -= 1 if paren_depth > 0
else
output << char
end
end
paren_depth.times{ output.reverse!.sub!('(', '').reverse! }
output
end
end
Algorithm:
Traverse through the given string.
While doing that, keep track of "(" positions in a stack.
If any ")" found, remove the top element from the stack.
If stack is empty, remove the ")" from the string.
In the end, we can have positions of unmatched braces, if any.
Java code:
Present # http://a2ajp.blogspot.in/2014/10/remove-unmatched-parenthesis-from-given.html

Treetop grammar issues using regular expressions

I have a simple grammar setup like so:
grammar Test
rule line
(adjective / not_adjective)* {
def content
elements.map{|e| e.content }
end
}
end
rule adjective
("good" / "bad" / "excellent") {
def content
[:adjective, text_value]
end
}
end
rule not_adjective
!adjective {
def content
[:not_adjective, text_value]
end
}
end
end
Let's say my input is "this is a good ball. let's use it". This gives an error, which I'm not mentioning right now because I want to understand the theory about why its wrong first.
So, how do I create rule not_adjective so that it matches anything that is not matched by rule adjective? In general, how to I write I rule (specifically in Treetop) that "doesnt" match another named rule?
Treetop is a parser generator that generates parsers out of a special class of grammars called Parsing Expression Grammars or PEG.
The operational interpretation of !expression is that it succeeds if expression fails and fails if expression succeeds but it consumes NO input.
To match anything that rule expression does not match use the dot operator (that matches anything) in conjunction with the negation operator to avoid certain "words":
( !expression . )* ie. "match anything BUT expression"
The previous answer is incorrect for the OP's question, since it will match any sequence of individual characters up to any adjective. So if you see the string xyzgood, it'll match xyz and a following rule will match the "good" part as an adjective. Likewise, the adjective rule of the OP will match the first three characters of "badge" as the adjective "bad", which isn't what they want.
Instead, the adjective rule should look something like this:
rule adjective
a:("good" / "bad" / "excellent") ![a-z] {
def content
[:adjective, a.text_value]
end
}
end
and the not_adjective rule like this:
rule not_adjective
!adjective w:([a-z]+) {
def content
[:not_adjective, w.text_value]
end
}
end
include handling for upper-case, hyphenation, apostrophes, etc, as necessary. You'll also need white-space handling, of course.

Ruby parse string

I have a string
input = "maybe (this is | that was) some ((nice | ugly) (day |night) | (strange (weather | time)))"
How is the best method in Ruby to parse this string ?
I mean the script should be able to build sententes like this :
maybe this is some ugly night
maybe that was some nice night
maybe this was some strange time
And so on, you got the point...
Should I read the string char by char and bulid a state machine with a stack to store the parenthesis values for later calculation, or is there a better approach ?
Maybe a ready, out of the box library for such purpose ?
Try Treetop. It is a Ruby-like DSL to describe grammars. Parsing the string you've given should be quite easy, and by using a real parser you'll easily be able to extend your grammar later.
An example grammar for the type of string that you want to parse (save as sentences.treetop):
grammar Sentences
rule sentence
# A sentence is a combination of one or more expressions.
expression* <Sentence>
end
rule expression
# An expression is either a literal or a parenthesised expression.
parenthesised / literal
end
rule parenthesised
# A parenthesised expression contains one or more sentences.
"(" (multiple / sentence) ")" <Parenthesised>
end
rule multiple
# Multiple sentences are delimited by a pipe.
sentence "|" (multiple / sentence) <Multiple>
end
rule literal
# A literal string contains of word characters (a-z) and/or spaces.
# Expand the character class to allow other characters too.
[a-zA-Z ]+ <Literal>
end
end
The grammar above needs an accompanying file that defines the classes that allow us to access the node values (save as sentence_nodes.rb).
class Sentence < Treetop::Runtime::SyntaxNode
def combine(a, b)
return b if a.empty?
a.inject([]) do |values, val_a|
values + b.collect { |val_b| val_a + val_b }
end
end
def values
elements.inject([]) do |values, element|
combine(values, element.values)
end
end
end
class Parenthesised < Treetop::Runtime::SyntaxNode
def values
elements[1].values
end
end
class Multiple < Treetop::Runtime::SyntaxNode
def values
elements[0].values + elements[2].values
end
end
class Literal < Treetop::Runtime::SyntaxNode
def values
[text_value]
end
end
The following example program shows that it is quite simple to parse the example sentence that you have given.
require "rubygems"
require "treetop"
require "sentence_nodes"
str = 'maybe (this is|that was) some' +
' ((nice|ugly) (day|night)|(strange (weather|time)))'
Treetop.load "sentences"
if sentence = SentencesParser.new.parse(str)
puts sentence.values
else
puts "Parse error"
end
The output of this program is:
maybe this is some nice day
maybe this is some nice night
maybe this is some ugly day
maybe this is some ugly night
maybe this is some strange weather
maybe this is some strange time
maybe that was some nice day
maybe that was some nice night
maybe that was some ugly day
maybe that was some ugly night
maybe that was some strange weather
maybe that was some strange time
You can also access the syntax tree:
p sentence
The output is here.
There you have it: a scalable parsing solution that should come quite close to what you want to do in about 50 lines of code. Does that help?

Resources