Ruby parse string - ruby

I have a string
input = "maybe (this is | that was) some ((nice | ugly) (day |night) | (strange (weather | time)))"
How is the best method in Ruby to parse this string ?
I mean the script should be able to build sententes like this :
maybe this is some ugly night
maybe that was some nice night
maybe this was some strange time
And so on, you got the point...
Should I read the string char by char and bulid a state machine with a stack to store the parenthesis values for later calculation, or is there a better approach ?
Maybe a ready, out of the box library for such purpose ?

Try Treetop. It is a Ruby-like DSL to describe grammars. Parsing the string you've given should be quite easy, and by using a real parser you'll easily be able to extend your grammar later.
An example grammar for the type of string that you want to parse (save as sentences.treetop):
grammar Sentences
rule sentence
# A sentence is a combination of one or more expressions.
expression* <Sentence>
end
rule expression
# An expression is either a literal or a parenthesised expression.
parenthesised / literal
end
rule parenthesised
# A parenthesised expression contains one or more sentences.
"(" (multiple / sentence) ")" <Parenthesised>
end
rule multiple
# Multiple sentences are delimited by a pipe.
sentence "|" (multiple / sentence) <Multiple>
end
rule literal
# A literal string contains of word characters (a-z) and/or spaces.
# Expand the character class to allow other characters too.
[a-zA-Z ]+ <Literal>
end
end
The grammar above needs an accompanying file that defines the classes that allow us to access the node values (save as sentence_nodes.rb).
class Sentence < Treetop::Runtime::SyntaxNode
def combine(a, b)
return b if a.empty?
a.inject([]) do |values, val_a|
values + b.collect { |val_b| val_a + val_b }
end
end
def values
elements.inject([]) do |values, element|
combine(values, element.values)
end
end
end
class Parenthesised < Treetop::Runtime::SyntaxNode
def values
elements[1].values
end
end
class Multiple < Treetop::Runtime::SyntaxNode
def values
elements[0].values + elements[2].values
end
end
class Literal < Treetop::Runtime::SyntaxNode
def values
[text_value]
end
end
The following example program shows that it is quite simple to parse the example sentence that you have given.
require "rubygems"
require "treetop"
require "sentence_nodes"
str = 'maybe (this is|that was) some' +
' ((nice|ugly) (day|night)|(strange (weather|time)))'
Treetop.load "sentences"
if sentence = SentencesParser.new.parse(str)
puts sentence.values
else
puts "Parse error"
end
The output of this program is:
maybe this is some nice day
maybe this is some nice night
maybe this is some ugly day
maybe this is some ugly night
maybe this is some strange weather
maybe this is some strange time
maybe that was some nice day
maybe that was some nice night
maybe that was some ugly day
maybe that was some ugly night
maybe that was some strange weather
maybe that was some strange time
You can also access the syntax tree:
p sentence
The output is here.
There you have it: a scalable parsing solution that should come quite close to what you want to do in about 50 lines of code. Does that help?

Related

Ruby: Matching Delimiters

I want to write a regex pattern to look for a delimiter within a string such as this: '//;\n1;2' where the character ; serves as the delimiter for the given string. I'm trying to solve a common coding kata problem, so I don't necessarily want an answer in code, but an idea as to how I can accomplish this with regex. This delimiter is subject to change and could be any character. Any ideas? Here's what I've written so far in my production file.
class StringCalculator
def add(numbers)
return 0 if numbers.empty?
return nil if numbers.end_with?('\n')
numbers.tr!('\n', ',')
numbers.split(',').map(&:to_i).reduce(:+)
end
end
Spec:
context 'when passed different delimiters' do
it 'should support semicolons' do
expect(#calculator.add('//;\n1;2')).to eq(3)
end
it 'should support percents' do
expect(#calculator.add('//%\n1%5')).to eq(6)
end
end

How do I break up a string around "{tags}"?

I am writing a function which can have two potential forms of input:
This is {a {string}}
This {is} a {string}
I call the sub-strings wrapped in curly-brackets "tags". I could potentially have any number of tags in a string, and they could be nested arbitrarily deep.
I've tried writing a regular expression to grab the tags, which of course fails on the nested tags, grabbing {a {string}, missing the second curly bracket. I can see it as a recursive problem, but after staring at the wrong answer too long I feel like I'm blind to seeing something really obvious.
What can I do to separate out the potential tags into parts so that they can be processed and replaced?
The More Complicated Version
def parseTags( oBody, szText )
if szText.match(/\{(.*)\}/)
szText.scan(/\{(.*)\}/) do |outers|
outers.each do |blah|
if blah.match(/(.*)\}(.*)\{(.*)/)
blah.scan(/(.*)\}(.*)\{(.*)/) do |inners|
inners.each do |tags|
szText = szText.sub("\{#{tags}\}", parseTags( oBody, tags ))
end
end
else
szText = szText.sub("\{#{blah}\}", parseTags( oBody, blah ))
end
end
end
end
if szText.match(/(\w+)\.(\w+)(?:\.([A-Za-z0-9.\[\]": ]*))/)
func = $1+"_"+$2
begin
szSub = self.send func, oBody, $3
rescue Exception=>e
szSub = "{Error: Function #{$1}_#{$2} not found}"
$stdout.puts "DynamicIO Error Encountered: #{e}"
end
szText = szText.sub("#{$1}.#{$2}#{$3!=nil ? "."+$3 : ""}", szSub)
end
return szText
end
This was the result of tinkering too long. It's not clean, but it did work for a case similar to "1" - {help.divider.red.sys.["{pc.login}"]} is replaced with ---------------[ Duwnel ]---------------. However, {pc.attr.str.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.pre.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.int.dotmode} implodes brilliantly, with random streaks of red and swatches of missing text.
To explain, anything marked {ansi.col.red} marks an ansi red code, reset escapes the color block, and {pc.attr.XXX.dotmode} displays a number between 1 and 10 in "o"s.
As others have noted, this is a perfect case for a parsing engine. Regular expressions don't tend to handle nested pairs well.
Treetop is an awesome PEG parser that you might be interested in taking a look at. The main idea is that you define everything that you want to parse (including whitespace) inside rules. The rules allow you to recursively parse things like bracket pairs.
Here's an example grammar for creating arrays of strings from nested bracket pairs. Usually grammars are defined in a separate file, but for simplicity I included the grammar at the end and loaded it with Ruby's DATA constant.
require 'treetop'
Treetop.load_from_string DATA.read
parser = BracketParser.new
p parser.parse('This is {a {string}}').value
#=> ["This is ", ["a ", ["string"]]]
p parser.parse('This {is} a {string}').value
#=> ["This ", ["is"], " a ", ["string"]]
__END__
grammar Bracket
rule string
(brackets / not_brackets)+
{
def value
elements.map{|e| e.value }
end
}
end
rule brackets
'{' string '}'
{
def value
elements[1].value
end
}
end
rule not_brackets
[^{}]+
{
def value
text_value
end
}
end
end
I would recommend instead of fitting more complex regular expressions to this problem, that you look into one of Ruby's grammar-based parsing engines. It is possible to design recursive and nested grammars in most of these.
parslet might be a good place to start for your problem. The erb-alike example, although it does not demonstrate nesting, might be closest to your needs: https://github.com/kschiess/parslet/blob/master/example/erb.rb

Optimising ruby regexp -- lots of match groups

I'm working on a ruby baser lexer. To improve performance, I joined up all tokens' regexps into one big regexp with match group names. The resulting regexp looks like:
/\A(?<__anonymous_-1038694222803470993>(?-mix:\n+))|\A(?<__anonymous_-1394418499721420065>(?-mix:\/\/[\A\n]*))|\A(?<__anonymous_3077187815313752157>(?-mix:include\s+"[\A"]+"))|\A(?<LET>(?-mix:let\s))|\A(?<IN>(?-mix:in\s))|\A(?<CLASS>(?-mix:class\s))|\A(?<DEF>(?-mix:def\s))|\A(?<DEFM>(?-mix:defm\s))|\A(?<MULTICLASS>(?-mix:multiclass\s))|\A(?<FUNCNAME>(?-mix:![a-zA-Z_][a-zA-Z0-9_]*))|\A(?<ID>(?-mix:[a-zA-Z_][a-zA-Z0-9_]*))|\A(?<STRING>(?-mix:"[\A"]*"))|\A(?<NUMBER>(?-mix:[0-9]+))/
I'm matching it to my string producing a MatchData where exactly one token is parsed:
bigregex =~ "\n ... garbage"
puts $~.inspect
Which outputs
#<MatchData
"\n"
__anonymous_-1038694222803470993:"\n"
__anonymous_-1394418499721420065:nil
__anonymous_3077187815313752157:nil
LET:nil
IN:nil
CLASS:nil
DEF:nil
DEFM:nil
MULTICLASS:nil
FUNCNAME:nil
ID:nil
STRING:nil
NUMBER:nil>
So, the regex actually matched the "\n" part. Now, I need to figure the match group where it belongs (it's clearly visible from #inspect output that it's _anonymous-1038694222803470993, but I need to get it programmatically).
I could not find any option other than iterating over #names:
m.names.each do |n|
if m[n]
type = n.to_sym
resolved_type = (n.start_with?('__anonymous_') ? nil : type)
val = m[n]
break
end
end
which verifies that the match group did have a match.
The problem here is that it's slow (I spend about 10% of time in the loop; also 8% grabbing the #input[#pos..-1] to make sure that \A works as expected to match start of string (I do not discard input, just shift the #pos in it).
You can check the full code at GH repo.
Any ideas on how to make it at least a bit faster? Is there any option to figure the "successful" match group easier?
You can do this using the regexp methods .captures() and .names():
matching_string = "\n ...garbage" # or whatever this really is in your code
#input = matching_string.match bigregex # bigregex = your regex
arr = #input.captures
arr.each_with_index do |value, index|
if not value.nil?
the_name_you_want = #input.names[index]
end
end
Or if you expect multiple successful values, you could do:
success_names_arr = []
success_names_arr.push(#input.names[index]) #within the above loop
Pretty similar to your original idea, but if you're looking for efficiency .captures() method should help with that.
I may have misunderstood this completely but but I'm assuming that all but one token is not nil and that's the one your after?
If so then, depending on the flavour of regex you're using, you could use a negative lookahead to check for a non-nil value
([^\n:]+:(?!nil)[^\n\>]+)
This will match the whole token ie NAME:value.

How do I obtain the (possibly nested) capture groups in a regular expression?

Given a regular expression:
/say (hullo|goodbye) to my lovely (.*)/
and a string:
"my $2 is happy that you said $1"
What is the best way to obtain a regular expression from the string that contains the capture groups in the regular expression? That is:
/my (.*) is happy that you said (hullo|goodbye)/
Clearly I could use regular expressions on a string representation of the original regular expression, but this would probably present difficulties with nested capture groups.
I'm using Ruby. My simple implementation so far goes along the lines of:
class Regexp
def capture_groups
self.to_s[1..-2].scan(/\(.*?\)/)
end
end
regexp.capture_groups.each_with_index do |capture, idx|
string.gsub!("$#{idx+1}", capture)
end
/^#{string}$/
i guess you need to create your own function that would do this:
create empty dictionaries groups and active_groups and initialize counter = 1
iterate over the characters in the string representation:
if current character = '(' and previous charaster != \:
add counter key to active_groups and increase counter
add current character to all active_groups
if current character = ')' and previous charaster != \:
remove the last item (key, value) from active_groups and add it to groups
convert groups to an array if needed
You might also want to implement:
ignore = True between unescaped '[' and ']'
reset counter if current character = '|' and active_groups is empty (or decrease counter if active_group is not empty)
UPDATES from comments:
ingore non-capturing groups starting with '(?:'
So once I realised that what I actually need is a regular expression parser, things started falling into place. I discovered this project:
https://github.com/dche/randall
which can generate strings that match a regular expression. It defines a regular expression grammar using http://treetop.rubyforge.org/. Unfortunately the grammar it defines is incomplete, though useful for many cases.
I also stumbled past https://github.com/mjijackson/citrus, which does a similar job to Treetop.
I then found this mind blowing gem:
https://github.com/ammar/regexp_parser
which defines a full regexp grammar and parses a regular expression into a walkable tree. I was then able to walk the tree and pick out the parts of the tree I wanted (the capture groups).
Unfortunately there was a minor bug, fixed in my fork: https://github.com/LaunchThing/regexp_parser.
Here's my patch to Regexp, that uses the fixed gem:
class Regexp
def parse
Regexp::Parser.parse(self.to_s, 'ruby/1.9')
end
def walk(e = self.parse, depth = 0, &block)
block.call(e, depth)
unless e.expressions.empty?
e.each do |s|
walk(s, depth+1, &block)
end
end
end
def capture_groups
capture_groups = []
walk do |e, depth|
capture_groups << e.to_s if Regexp::Expression::Group::Capture === e
end
capture_groups
end
end
I can then use this in my application to make replacements in my string - the final goal - along these lines:
from = /^\/search\/(.*)$/
to = '/buy/$1'
to_as_regexp = to.dup
# I should probably make this gsub tighter
from.capture_groups.each_with_index do |capture, idx|
to_as_regexp.gsub!("$#{idx+1}", capture)
end
to_as_regexp = /^#{to_as_regexp}$/
# to_as_regexp = /^\/buy\/(.*)$/
I hope this helps someone else out.

Remove unmatched parentheses from a string

I want to remove "un-partnered" parentheses from a string.
I.e., all ('s should be removed unless they're followed by a ) somewhere in the string. Likewise, all )'s not preceded by a ( somewhere in the string should be removed.
Ideally the algorithm would take into account nesting as well.
E.g.:
"(a)".remove_unmatched_parents # => "(a)"
"a(".remove_unmatched_parents # => "a"
")a(".remove_unmatched_parents # => "a"
Instead of a regex, consider a push-down automata, perhaps. (I'm not sure if Ruby regular expressions can handle this, I believe Perl's can).
A (very trivialized) process may be:
For each character in the input string:
If it is not a '(' or ')' then just append it to the output
If it is a '(' increase a seen_parens counter and add it
If it is a ')' and seen_parens is > 0, add it and decrease seen_parens. Otherwise skip it.
At the end of the process, if seen_parens is > 0 then remove that many parens, starting from the end. (This step can be merged into the above process with use of a stack or recursion.)
The entire process is O(n), even if a relatively high overhead
Happy coding.
The following uses oniguruma. Oniguruma is the regex engine built in if you are using ruby1.9. If you are using ruby1.8, see this: oniguruma.
Update
I had been so lazy to just copy and paste someone else's regex. It seemed to have problem.
So now, I wrote my own. I believe it should work now.
class String
NonParenChar = /[^\(\)]/
def remove_unmatched_parens
self[/
(?:
(?<balanced>
\(
(?:\g<balanced>|#{NonParenChar})*
\)
)
|#{NonParenChar}
)+
/x]
end
end
(?<name>regex1) names the (sub)regex regex1 as name, and makes it possible to be called.
?g<name> will be a subregex that represents regex1. Note here that ?g<name> does not represent a particular string that matched regex1, but it represents regex1 itself. In fact, it is possible to embed ?g<name> within (?<name>...).
Update 2
This is simpler.
class String
def remove_unmatched_parens
self[/
(?<valid>
\(\g<valid>*\)
|[^()]
)+
/x]
end
end
Build a simple LR parser:
tokenize, token, stack = false, "", []
")(a))(()(asdf)(".each_char do |c|
case c
when '('
tokenize = true
token = c
when ')'
if tokenize
token << c
stack << token
end
tokenize = false
when /\w/
token << c if tokenize
end
end
result = stack.join
puts result
running yields:
wesbailey#feynman:~/code_katas> ruby test.rb
(a)()(asdf)
I don't agree with the folks modifying the String class because you should never open a standard class. Regexs are pretty brittle for parser and hard to support. I couldn't imagine coming back to the previous solutions 6 months for now and trying to remember what they were doing!
Here's my solution, based on #pst's algorithm:
class String
def remove_unmatched_parens
scanner = StringScanner.new(dup)
output = ''
paren_depth = 0
while char = scanner.get_byte
if char == "("
paren_depth += 1
output << char
elsif char == ")"
output << char and paren_depth -= 1 if paren_depth > 0
else
output << char
end
end
paren_depth.times{ output.reverse!.sub!('(', '').reverse! }
output
end
end
Algorithm:
Traverse through the given string.
While doing that, keep track of "(" positions in a stack.
If any ")" found, remove the top element from the stack.
If stack is empty, remove the ")" from the string.
In the end, we can have positions of unmatched braces, if any.
Java code:
Present # http://a2ajp.blogspot.in/2014/10/remove-unmatched-parenthesis-from-given.html

Resources