Parse string with nested brackets - ruby

String to parse (without spaces):
"instrumentalist ( bass (upright , fretless , 5-string ) , guitar ( electric , acoustic ) , trumpet ), teacher , songwriter, producer"
I need to get this structure in Ruby
["instrumentalist",[["bass",["upright","fretless","5-string"]],["guitar",["electric","acoustic"]],["trumpet"]],["teacher"],["songwriter"],["producer"]]
Because of nested (,) and , String#partition couldn't help me. I don't really know is there a fancy RegEx that could extract such type of strings. Or do I have to go with a lexer?

A regex on its own isn't the right sort of thing for this type of problem, even though the basic process is simple: walk through your string looking for commas or brackets. When you find a comma add the previous read characters to the current nesting. When you find an open bracket then your nesting level goes up by 1, when you find a close bracket decrease it.
StringScanner is designed for this sort of stuff as it allows us to walk through the string while maintaining, some state, in this case a stack that mirrors your opening and closing brackets. Something like this does the job for me
require 'strscan'
def parse input
scanner = StringScanner.new input
stack = [[]]
while string = scanner.scan(/[^(),]+/)
case scanner.scan /[(),]+/
when '('
new_nesting = [string, []]
stack.last << new_nesting
stack << new_nesting[1]
when ')'
scanner.scan(/,/)
stack.last << string
stack.pop
else
stack.last << string
end
end
stack.last
end

Related

Longest word test from appacademy practice in ruby

I am trying to do this test and there are bunch of solutions online and here but I first want to figure out why my solution is wrong even though it seems that it puts right results when I enter certain strings :
Here is what they are asking :
Write a method that takes in a string. Return the longest word in the
string. You may assume that the string contains only letters and
spaces.
You may use the String split method to aid you in your quest.
Here is my solution where I thought I could turn string into array, sort it from max length descending and then just print first element in that new string like this :
def longest_word(sentence)
sentence = sentence.split
sentence.sort_by! { |longest| -longest.length }
return sentence[0]
end
That doesn't seem to work obviously since their test gives me all false..here is the test :
puts("\nTests for #longest_word")
puts("===============================================")
puts(
'longest_word("short longest") == "longest": ' +
(longest_word('short longest') == 'longest').to_s
)
puts(
'longest_word("one") == "one": ' +
(longest_word('one') == 'one').to_s
)
puts(
'longest_word("abc def abcde") == "abcde": ' +
(longest_word('abc def abcde') == 'abcde').to_s
)
puts("===============================================")
So the question is why? And can I just fix my code or the idea is all wrong and I need to do it completely different?
str = "Which word in this string is longest?"
r = /[[:alpha:]]+/
str.scan(r).max_by(&:length)
#=> "longest"
This regular expression reads, "match one or more characters". The outer brackets constitute a character class, meaning one of the characters within the brackets must be matched.
To deal with words that are hyphenated or contain single quotes, the following is an imperfect modification1:
str = "Who said that chicken is finger-licken' good?"
r = /[[[:alpha:]]'-]+/
str.scan(r).max_by(&:length)
#=> "finger-licken'"
This regular expression reads, "match one or more characters that are a letter, apostrophe or hyphen". The outer brackets constitute a character class, meaning one of the characters within the brackets must be matched.
1 I've successfully used "finger-licken'" in scrabble.
I'd write it something like:
str = "Write a method that takes in a string"
str.split.sort_by(&:length).last # => "string"

How do I break up a string around "{tags}"?

I am writing a function which can have two potential forms of input:
This is {a {string}}
This {is} a {string}
I call the sub-strings wrapped in curly-brackets "tags". I could potentially have any number of tags in a string, and they could be nested arbitrarily deep.
I've tried writing a regular expression to grab the tags, which of course fails on the nested tags, grabbing {a {string}, missing the second curly bracket. I can see it as a recursive problem, but after staring at the wrong answer too long I feel like I'm blind to seeing something really obvious.
What can I do to separate out the potential tags into parts so that they can be processed and replaced?
The More Complicated Version
def parseTags( oBody, szText )
if szText.match(/\{(.*)\}/)
szText.scan(/\{(.*)\}/) do |outers|
outers.each do |blah|
if blah.match(/(.*)\}(.*)\{(.*)/)
blah.scan(/(.*)\}(.*)\{(.*)/) do |inners|
inners.each do |tags|
szText = szText.sub("\{#{tags}\}", parseTags( oBody, tags ))
end
end
else
szText = szText.sub("\{#{blah}\}", parseTags( oBody, blah ))
end
end
end
end
if szText.match(/(\w+)\.(\w+)(?:\.([A-Za-z0-9.\[\]": ]*))/)
func = $1+"_"+$2
begin
szSub = self.send func, oBody, $3
rescue Exception=>e
szSub = "{Error: Function #{$1}_#{$2} not found}"
$stdout.puts "DynamicIO Error Encountered: #{e}"
end
szText = szText.sub("#{$1}.#{$2}#{$3!=nil ? "."+$3 : ""}", szSub)
end
return szText
end
This was the result of tinkering too long. It's not clean, but it did work for a case similar to "1" - {help.divider.red.sys.["{pc.login}"]} is replaced with ---------------[ Duwnel ]---------------. However, {pc.attr.str.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.pre.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.int.dotmode} implodes brilliantly, with random streaks of red and swatches of missing text.
To explain, anything marked {ansi.col.red} marks an ansi red code, reset escapes the color block, and {pc.attr.XXX.dotmode} displays a number between 1 and 10 in "o"s.
As others have noted, this is a perfect case for a parsing engine. Regular expressions don't tend to handle nested pairs well.
Treetop is an awesome PEG parser that you might be interested in taking a look at. The main idea is that you define everything that you want to parse (including whitespace) inside rules. The rules allow you to recursively parse things like bracket pairs.
Here's an example grammar for creating arrays of strings from nested bracket pairs. Usually grammars are defined in a separate file, but for simplicity I included the grammar at the end and loaded it with Ruby's DATA constant.
require 'treetop'
Treetop.load_from_string DATA.read
parser = BracketParser.new
p parser.parse('This is {a {string}}').value
#=> ["This is ", ["a ", ["string"]]]
p parser.parse('This {is} a {string}').value
#=> ["This ", ["is"], " a ", ["string"]]
__END__
grammar Bracket
rule string
(brackets / not_brackets)+
{
def value
elements.map{|e| e.value }
end
}
end
rule brackets
'{' string '}'
{
def value
elements[1].value
end
}
end
rule not_brackets
[^{}]+
{
def value
text_value
end
}
end
end
I would recommend instead of fitting more complex regular expressions to this problem, that you look into one of Ruby's grammar-based parsing engines. It is possible to design recursive and nested grammars in most of these.
parslet might be a good place to start for your problem. The erb-alike example, although it does not demonstrate nesting, might be closest to your needs: https://github.com/kschiess/parslet/blob/master/example/erb.rb

How do I obtain the (possibly nested) capture groups in a regular expression?

Given a regular expression:
/say (hullo|goodbye) to my lovely (.*)/
and a string:
"my $2 is happy that you said $1"
What is the best way to obtain a regular expression from the string that contains the capture groups in the regular expression? That is:
/my (.*) is happy that you said (hullo|goodbye)/
Clearly I could use regular expressions on a string representation of the original regular expression, but this would probably present difficulties with nested capture groups.
I'm using Ruby. My simple implementation so far goes along the lines of:
class Regexp
def capture_groups
self.to_s[1..-2].scan(/\(.*?\)/)
end
end
regexp.capture_groups.each_with_index do |capture, idx|
string.gsub!("$#{idx+1}", capture)
end
/^#{string}$/
i guess you need to create your own function that would do this:
create empty dictionaries groups and active_groups and initialize counter = 1
iterate over the characters in the string representation:
if current character = '(' and previous charaster != \:
add counter key to active_groups and increase counter
add current character to all active_groups
if current character = ')' and previous charaster != \:
remove the last item (key, value) from active_groups and add it to groups
convert groups to an array if needed
You might also want to implement:
ignore = True between unescaped '[' and ']'
reset counter if current character = '|' and active_groups is empty (or decrease counter if active_group is not empty)
UPDATES from comments:
ingore non-capturing groups starting with '(?:'
So once I realised that what I actually need is a regular expression parser, things started falling into place. I discovered this project:
https://github.com/dche/randall
which can generate strings that match a regular expression. It defines a regular expression grammar using http://treetop.rubyforge.org/. Unfortunately the grammar it defines is incomplete, though useful for many cases.
I also stumbled past https://github.com/mjijackson/citrus, which does a similar job to Treetop.
I then found this mind blowing gem:
https://github.com/ammar/regexp_parser
which defines a full regexp grammar and parses a regular expression into a walkable tree. I was then able to walk the tree and pick out the parts of the tree I wanted (the capture groups).
Unfortunately there was a minor bug, fixed in my fork: https://github.com/LaunchThing/regexp_parser.
Here's my patch to Regexp, that uses the fixed gem:
class Regexp
def parse
Regexp::Parser.parse(self.to_s, 'ruby/1.9')
end
def walk(e = self.parse, depth = 0, &block)
block.call(e, depth)
unless e.expressions.empty?
e.each do |s|
walk(s, depth+1, &block)
end
end
end
def capture_groups
capture_groups = []
walk do |e, depth|
capture_groups << e.to_s if Regexp::Expression::Group::Capture === e
end
capture_groups
end
end
I can then use this in my application to make replacements in my string - the final goal - along these lines:
from = /^\/search\/(.*)$/
to = '/buy/$1'
to_as_regexp = to.dup
# I should probably make this gsub tighter
from.capture_groups.each_with_index do |capture, idx|
to_as_regexp.gsub!("$#{idx+1}", capture)
end
to_as_regexp = /^#{to_as_regexp}$/
# to_as_regexp = /^\/buy\/(.*)$/
I hope this helps someone else out.

Remove unmatched parentheses from a string

I want to remove "un-partnered" parentheses from a string.
I.e., all ('s should be removed unless they're followed by a ) somewhere in the string. Likewise, all )'s not preceded by a ( somewhere in the string should be removed.
Ideally the algorithm would take into account nesting as well.
E.g.:
"(a)".remove_unmatched_parents # => "(a)"
"a(".remove_unmatched_parents # => "a"
")a(".remove_unmatched_parents # => "a"
Instead of a regex, consider a push-down automata, perhaps. (I'm not sure if Ruby regular expressions can handle this, I believe Perl's can).
A (very trivialized) process may be:
For each character in the input string:
If it is not a '(' or ')' then just append it to the output
If it is a '(' increase a seen_parens counter and add it
If it is a ')' and seen_parens is > 0, add it and decrease seen_parens. Otherwise skip it.
At the end of the process, if seen_parens is > 0 then remove that many parens, starting from the end. (This step can be merged into the above process with use of a stack or recursion.)
The entire process is O(n), even if a relatively high overhead
Happy coding.
The following uses oniguruma. Oniguruma is the regex engine built in if you are using ruby1.9. If you are using ruby1.8, see this: oniguruma.
Update
I had been so lazy to just copy and paste someone else's regex. It seemed to have problem.
So now, I wrote my own. I believe it should work now.
class String
NonParenChar = /[^\(\)]/
def remove_unmatched_parens
self[/
(?:
(?<balanced>
\(
(?:\g<balanced>|#{NonParenChar})*
\)
)
|#{NonParenChar}
)+
/x]
end
end
(?<name>regex1) names the (sub)regex regex1 as name, and makes it possible to be called.
?g<name> will be a subregex that represents regex1. Note here that ?g<name> does not represent a particular string that matched regex1, but it represents regex1 itself. In fact, it is possible to embed ?g<name> within (?<name>...).
Update 2
This is simpler.
class String
def remove_unmatched_parens
self[/
(?<valid>
\(\g<valid>*\)
|[^()]
)+
/x]
end
end
Build a simple LR parser:
tokenize, token, stack = false, "", []
")(a))(()(asdf)(".each_char do |c|
case c
when '('
tokenize = true
token = c
when ')'
if tokenize
token << c
stack << token
end
tokenize = false
when /\w/
token << c if tokenize
end
end
result = stack.join
puts result
running yields:
wesbailey#feynman:~/code_katas> ruby test.rb
(a)()(asdf)
I don't agree with the folks modifying the String class because you should never open a standard class. Regexs are pretty brittle for parser and hard to support. I couldn't imagine coming back to the previous solutions 6 months for now and trying to remember what they were doing!
Here's my solution, based on #pst's algorithm:
class String
def remove_unmatched_parens
scanner = StringScanner.new(dup)
output = ''
paren_depth = 0
while char = scanner.get_byte
if char == "("
paren_depth += 1
output << char
elsif char == ")"
output << char and paren_depth -= 1 if paren_depth > 0
else
output << char
end
end
paren_depth.times{ output.reverse!.sub!('(', '').reverse! }
output
end
end
Algorithm:
Traverse through the given string.
While doing that, keep track of "(" positions in a stack.
If any ")" found, remove the top element from the stack.
If stack is empty, remove the ")" from the string.
In the end, we can have positions of unmatched braces, if any.
Java code:
Present # http://a2ajp.blogspot.in/2014/10/remove-unmatched-parenthesis-from-given.html

Ruby parse string

I have a string
input = "maybe (this is | that was) some ((nice | ugly) (day |night) | (strange (weather | time)))"
How is the best method in Ruby to parse this string ?
I mean the script should be able to build sententes like this :
maybe this is some ugly night
maybe that was some nice night
maybe this was some strange time
And so on, you got the point...
Should I read the string char by char and bulid a state machine with a stack to store the parenthesis values for later calculation, or is there a better approach ?
Maybe a ready, out of the box library for such purpose ?
Try Treetop. It is a Ruby-like DSL to describe grammars. Parsing the string you've given should be quite easy, and by using a real parser you'll easily be able to extend your grammar later.
An example grammar for the type of string that you want to parse (save as sentences.treetop):
grammar Sentences
rule sentence
# A sentence is a combination of one or more expressions.
expression* <Sentence>
end
rule expression
# An expression is either a literal or a parenthesised expression.
parenthesised / literal
end
rule parenthesised
# A parenthesised expression contains one or more sentences.
"(" (multiple / sentence) ")" <Parenthesised>
end
rule multiple
# Multiple sentences are delimited by a pipe.
sentence "|" (multiple / sentence) <Multiple>
end
rule literal
# A literal string contains of word characters (a-z) and/or spaces.
# Expand the character class to allow other characters too.
[a-zA-Z ]+ <Literal>
end
end
The grammar above needs an accompanying file that defines the classes that allow us to access the node values (save as sentence_nodes.rb).
class Sentence < Treetop::Runtime::SyntaxNode
def combine(a, b)
return b if a.empty?
a.inject([]) do |values, val_a|
values + b.collect { |val_b| val_a + val_b }
end
end
def values
elements.inject([]) do |values, element|
combine(values, element.values)
end
end
end
class Parenthesised < Treetop::Runtime::SyntaxNode
def values
elements[1].values
end
end
class Multiple < Treetop::Runtime::SyntaxNode
def values
elements[0].values + elements[2].values
end
end
class Literal < Treetop::Runtime::SyntaxNode
def values
[text_value]
end
end
The following example program shows that it is quite simple to parse the example sentence that you have given.
require "rubygems"
require "treetop"
require "sentence_nodes"
str = 'maybe (this is|that was) some' +
' ((nice|ugly) (day|night)|(strange (weather|time)))'
Treetop.load "sentences"
if sentence = SentencesParser.new.parse(str)
puts sentence.values
else
puts "Parse error"
end
The output of this program is:
maybe this is some nice day
maybe this is some nice night
maybe this is some ugly day
maybe this is some ugly night
maybe this is some strange weather
maybe this is some strange time
maybe that was some nice day
maybe that was some nice night
maybe that was some ugly day
maybe that was some ugly night
maybe that was some strange weather
maybe that was some strange time
You can also access the syntax tree:
p sentence
The output is here.
There you have it: a scalable parsing solution that should come quite close to what you want to do in about 50 lines of code. Does that help?

Resources