I want to remove "un-partnered" parentheses from a string.
I.e., all ('s should be removed unless they're followed by a ) somewhere in the string. Likewise, all )'s not preceded by a ( somewhere in the string should be removed.
Ideally the algorithm would take into account nesting as well.
E.g.:
"(a)".remove_unmatched_parents # => "(a)"
"a(".remove_unmatched_parents # => "a"
")a(".remove_unmatched_parents # => "a"
Instead of a regex, consider a push-down automata, perhaps. (I'm not sure if Ruby regular expressions can handle this, I believe Perl's can).
A (very trivialized) process may be:
For each character in the input string:
If it is not a '(' or ')' then just append it to the output
If it is a '(' increase a seen_parens counter and add it
If it is a ')' and seen_parens is > 0, add it and decrease seen_parens. Otherwise skip it.
At the end of the process, if seen_parens is > 0 then remove that many parens, starting from the end. (This step can be merged into the above process with use of a stack or recursion.)
The entire process is O(n), even if a relatively high overhead
Happy coding.
The following uses oniguruma. Oniguruma is the regex engine built in if you are using ruby1.9. If you are using ruby1.8, see this: oniguruma.
Update
I had been so lazy to just copy and paste someone else's regex. It seemed to have problem.
So now, I wrote my own. I believe it should work now.
class String
NonParenChar = /[^\(\)]/
def remove_unmatched_parens
self[/
(?:
(?<balanced>
\(
(?:\g<balanced>|#{NonParenChar})*
\)
)
|#{NonParenChar}
)+
/x]
end
end
(?<name>regex1) names the (sub)regex regex1 as name, and makes it possible to be called.
?g<name> will be a subregex that represents regex1. Note here that ?g<name> does not represent a particular string that matched regex1, but it represents regex1 itself. In fact, it is possible to embed ?g<name> within (?<name>...).
Update 2
This is simpler.
class String
def remove_unmatched_parens
self[/
(?<valid>
\(\g<valid>*\)
|[^()]
)+
/x]
end
end
Build a simple LR parser:
tokenize, token, stack = false, "", []
")(a))(()(asdf)(".each_char do |c|
case c
when '('
tokenize = true
token = c
when ')'
if tokenize
token << c
stack << token
end
tokenize = false
when /\w/
token << c if tokenize
end
end
result = stack.join
puts result
running yields:
wesbailey#feynman:~/code_katas> ruby test.rb
(a)()(asdf)
I don't agree with the folks modifying the String class because you should never open a standard class. Regexs are pretty brittle for parser and hard to support. I couldn't imagine coming back to the previous solutions 6 months for now and trying to remember what they were doing!
Here's my solution, based on #pst's algorithm:
class String
def remove_unmatched_parens
scanner = StringScanner.new(dup)
output = ''
paren_depth = 0
while char = scanner.get_byte
if char == "("
paren_depth += 1
output << char
elsif char == ")"
output << char and paren_depth -= 1 if paren_depth > 0
else
output << char
end
end
paren_depth.times{ output.reverse!.sub!('(', '').reverse! }
output
end
end
Algorithm:
Traverse through the given string.
While doing that, keep track of "(" positions in a stack.
If any ")" found, remove the top element from the stack.
If stack is empty, remove the ")" from the string.
In the end, we can have positions of unmatched braces, if any.
Java code:
Present # http://a2ajp.blogspot.in/2014/10/remove-unmatched-parenthesis-from-given.html
Related
Using the oliver.txt
write a method called count_paragraphs that counts the number of paragraphs in the text.
In oliver.txt the paragraph delimiter consists of two or more consecutive newline characters, like this: \n\n, \n\n\n, or even \n\n\n\n.
Your method should return either the number of paragraphs or nil.
I have this code but it doesn't work:
def count_paragraphs(some_file)
file_content = open(some_file).read()
count = 0
file_content_split = file_content.split('')
file_content_split.each_index do |index|
count += 1 if file_content_split[index] == "\n" && file_content_split[index + 1] == "\n"
end
return count
end
# test code
p count_paragraphs("oliver.txt")
It's much easier to either count it directly:
file_content.split(/\n\n+/).count
or count the separators and add one:
file_content.scan(/\n\n+/).count + 1
To determine the number of paragraphs there is no need to construct an array and determine its size. One can instead operate on the string directly by creating an enumerator and counting the number of elements it will generate (after some cleaning of the file contents). This can be done with an unconventional (but highly useful) form of the method String#gsub.
Code
def count_paragraphs(fname)
(File.read(fname).gsub(/ +$/,'') << "\n\n").gsub(/\S\n{2,}/).count
end
Examples
First let us construct a text file.
str =<<BITTER_END
Now is the time
for all good
Rubiest to take
a break.
Oh, happy
day.
One for all,
all for one.
Amen!
BITTER_END
# " \n\nNow is the time\nfor all good\nRubiest to take\na break.\n \n \nOh, happy\nday.\n\nOne for all,\nall for one.\n\n \nAmen!\n"
Note the embedded spaces.
FNAME = 'temp'
File.write(FNAME, str)
#=> 128
Now test the method with this file.
count_paragraphs(FNAME)
#=> 4
One more:
count_paragraphs('oliver.txt')
#=> 61
Explanation
The first step is deal with ill-formed text by removing spaces immediately preceding newlines:
File.read(fname).gsub(/ +$/,'')
#=> "\n\nNow is the time\nfor all good\nRubiest to take\na break.\n\n\nOh, happy\nday.\n\nOne for all,\nall for one.\n\n\nAmen!\n"
Next, two newlines are appended so we can identify all paragraphs, including the last, as containing a non-whitespace character followed by two or more newlines.1.
Note that files containing only spaces and newlines are found to contain zero paragraphs.
If the file is known to contain no ill-formed text, the operative line of the method can be simplified to:
(File.read(fname) << "\n\n").gsub(/\S\n{2,}/).count
See Enumerable#count and IO#read. (As File.superclass #=> IO, read is also in instance method of the class File, and seems to be more commonly invoked on that class than on IO.)
Note that String#gsub without a block returns an enumerator (to which Enumerable#count is applied),
Aside: I believe this form of gsub would be more widely used if it merely had a separate name, such as pattern_match. Calling it gsub seems a misnomer, as it has nothing to do with "substitution", "global" or otherwise.
1 I revised my original answer to deal with ill-formed text, and in doing so borrowed #Kimmo's idea of requiring matches to include a non-whitespace character.
How about a loop that memoizes the previous character and a state of being in or outside of a paragraph?
def count_paragraphs(some_file)
paragraphs = 0
in_paragraph = false
previous_char = ""
File.open(some_file).each_char do |char|
if !in_paragraph && char != "\n"
paragraphs += 1
in_paragraph = true
elsif in_paragraph && char == "\n" && previous_char == "\n"
in_paragraph = false
end
previous_char = char
end
paragraphs
rescue
nil
end
This solution does not build any temporary arrays of the full content so you could parse a huge file without it being read into memory. Also, there are no regular expressions.
The rescue was added because of the "Your function should return either the number of paragraphs or nil" which did not give a clear definition of when a nil should be returned. In this case it will be returned if any exception happens, for example if the file isn't found or can't be read, which will raise an exception that will be catched by the rescue.
You don't need an explicit return in Ruby. The return value of the last statement will be used as the method's return value.
I am trying to do this test and there are bunch of solutions online and here but I first want to figure out why my solution is wrong even though it seems that it puts right results when I enter certain strings :
Here is what they are asking :
Write a method that takes in a string. Return the longest word in the
string. You may assume that the string contains only letters and
spaces.
You may use the String split method to aid you in your quest.
Here is my solution where I thought I could turn string into array, sort it from max length descending and then just print first element in that new string like this :
def longest_word(sentence)
sentence = sentence.split
sentence.sort_by! { |longest| -longest.length }
return sentence[0]
end
That doesn't seem to work obviously since their test gives me all false..here is the test :
puts("\nTests for #longest_word")
puts("===============================================")
puts(
'longest_word("short longest") == "longest": ' +
(longest_word('short longest') == 'longest').to_s
)
puts(
'longest_word("one") == "one": ' +
(longest_word('one') == 'one').to_s
)
puts(
'longest_word("abc def abcde") == "abcde": ' +
(longest_word('abc def abcde') == 'abcde').to_s
)
puts("===============================================")
So the question is why? And can I just fix my code or the idea is all wrong and I need to do it completely different?
str = "Which word in this string is longest?"
r = /[[:alpha:]]+/
str.scan(r).max_by(&:length)
#=> "longest"
This regular expression reads, "match one or more characters". The outer brackets constitute a character class, meaning one of the characters within the brackets must be matched.
To deal with words that are hyphenated or contain single quotes, the following is an imperfect modification1:
str = "Who said that chicken is finger-licken' good?"
r = /[[[:alpha:]]'-]+/
str.scan(r).max_by(&:length)
#=> "finger-licken'"
This regular expression reads, "match one or more characters that are a letter, apostrophe or hyphen". The outer brackets constitute a character class, meaning one of the characters within the brackets must be matched.
1 I've successfully used "finger-licken'" in scrabble.
I'd write it something like:
str = "Write a method that takes in a string"
str.split.sort_by(&:length).last # => "string"
String to parse (without spaces):
"instrumentalist ( bass (upright , fretless , 5-string ) , guitar ( electric , acoustic ) , trumpet ), teacher , songwriter, producer"
I need to get this structure in Ruby
["instrumentalist",[["bass",["upright","fretless","5-string"]],["guitar",["electric","acoustic"]],["trumpet"]],["teacher"],["songwriter"],["producer"]]
Because of nested (,) and , String#partition couldn't help me. I don't really know is there a fancy RegEx that could extract such type of strings. Or do I have to go with a lexer?
A regex on its own isn't the right sort of thing for this type of problem, even though the basic process is simple: walk through your string looking for commas or brackets. When you find a comma add the previous read characters to the current nesting. When you find an open bracket then your nesting level goes up by 1, when you find a close bracket decrease it.
StringScanner is designed for this sort of stuff as it allows us to walk through the string while maintaining, some state, in this case a stack that mirrors your opening and closing brackets. Something like this does the job for me
require 'strscan'
def parse input
scanner = StringScanner.new input
stack = [[]]
while string = scanner.scan(/[^(),]+/)
case scanner.scan /[(),]+/
when '('
new_nesting = [string, []]
stack.last << new_nesting
stack << new_nesting[1]
when ')'
scanner.scan(/,/)
stack.last << string
stack.pop
else
stack.last << string
end
end
stack.last
end
I'm working on a ruby baser lexer. To improve performance, I joined up all tokens' regexps into one big regexp with match group names. The resulting regexp looks like:
/\A(?<__anonymous_-1038694222803470993>(?-mix:\n+))|\A(?<__anonymous_-1394418499721420065>(?-mix:\/\/[\A\n]*))|\A(?<__anonymous_3077187815313752157>(?-mix:include\s+"[\A"]+"))|\A(?<LET>(?-mix:let\s))|\A(?<IN>(?-mix:in\s))|\A(?<CLASS>(?-mix:class\s))|\A(?<DEF>(?-mix:def\s))|\A(?<DEFM>(?-mix:defm\s))|\A(?<MULTICLASS>(?-mix:multiclass\s))|\A(?<FUNCNAME>(?-mix:![a-zA-Z_][a-zA-Z0-9_]*))|\A(?<ID>(?-mix:[a-zA-Z_][a-zA-Z0-9_]*))|\A(?<STRING>(?-mix:"[\A"]*"))|\A(?<NUMBER>(?-mix:[0-9]+))/
I'm matching it to my string producing a MatchData where exactly one token is parsed:
bigregex =~ "\n ... garbage"
puts $~.inspect
Which outputs
#<MatchData
"\n"
__anonymous_-1038694222803470993:"\n"
__anonymous_-1394418499721420065:nil
__anonymous_3077187815313752157:nil
LET:nil
IN:nil
CLASS:nil
DEF:nil
DEFM:nil
MULTICLASS:nil
FUNCNAME:nil
ID:nil
STRING:nil
NUMBER:nil>
So, the regex actually matched the "\n" part. Now, I need to figure the match group where it belongs (it's clearly visible from #inspect output that it's _anonymous-1038694222803470993, but I need to get it programmatically).
I could not find any option other than iterating over #names:
m.names.each do |n|
if m[n]
type = n.to_sym
resolved_type = (n.start_with?('__anonymous_') ? nil : type)
val = m[n]
break
end
end
which verifies that the match group did have a match.
The problem here is that it's slow (I spend about 10% of time in the loop; also 8% grabbing the #input[#pos..-1] to make sure that \A works as expected to match start of string (I do not discard input, just shift the #pos in it).
You can check the full code at GH repo.
Any ideas on how to make it at least a bit faster? Is there any option to figure the "successful" match group easier?
You can do this using the regexp methods .captures() and .names():
matching_string = "\n ...garbage" # or whatever this really is in your code
#input = matching_string.match bigregex # bigregex = your regex
arr = #input.captures
arr.each_with_index do |value, index|
if not value.nil?
the_name_you_want = #input.names[index]
end
end
Or if you expect multiple successful values, you could do:
success_names_arr = []
success_names_arr.push(#input.names[index]) #within the above loop
Pretty similar to your original idea, but if you're looking for efficiency .captures() method should help with that.
I may have misunderstood this completely but but I'm assuming that all but one token is not nil and that's the one your after?
If so then, depending on the flavour of regex you're using, you could use a negative lookahead to check for a non-nil value
([^\n:]+:(?!nil)[^\n\>]+)
This will match the whole token ie NAME:value.
Given a regular expression:
/say (hullo|goodbye) to my lovely (.*)/
and a string:
"my $2 is happy that you said $1"
What is the best way to obtain a regular expression from the string that contains the capture groups in the regular expression? That is:
/my (.*) is happy that you said (hullo|goodbye)/
Clearly I could use regular expressions on a string representation of the original regular expression, but this would probably present difficulties with nested capture groups.
I'm using Ruby. My simple implementation so far goes along the lines of:
class Regexp
def capture_groups
self.to_s[1..-2].scan(/\(.*?\)/)
end
end
regexp.capture_groups.each_with_index do |capture, idx|
string.gsub!("$#{idx+1}", capture)
end
/^#{string}$/
i guess you need to create your own function that would do this:
create empty dictionaries groups and active_groups and initialize counter = 1
iterate over the characters in the string representation:
if current character = '(' and previous charaster != \:
add counter key to active_groups and increase counter
add current character to all active_groups
if current character = ')' and previous charaster != \:
remove the last item (key, value) from active_groups and add it to groups
convert groups to an array if needed
You might also want to implement:
ignore = True between unescaped '[' and ']'
reset counter if current character = '|' and active_groups is empty (or decrease counter if active_group is not empty)
UPDATES from comments:
ingore non-capturing groups starting with '(?:'
So once I realised that what I actually need is a regular expression parser, things started falling into place. I discovered this project:
https://github.com/dche/randall
which can generate strings that match a regular expression. It defines a regular expression grammar using http://treetop.rubyforge.org/. Unfortunately the grammar it defines is incomplete, though useful for many cases.
I also stumbled past https://github.com/mjijackson/citrus, which does a similar job to Treetop.
I then found this mind blowing gem:
https://github.com/ammar/regexp_parser
which defines a full regexp grammar and parses a regular expression into a walkable tree. I was then able to walk the tree and pick out the parts of the tree I wanted (the capture groups).
Unfortunately there was a minor bug, fixed in my fork: https://github.com/LaunchThing/regexp_parser.
Here's my patch to Regexp, that uses the fixed gem:
class Regexp
def parse
Regexp::Parser.parse(self.to_s, 'ruby/1.9')
end
def walk(e = self.parse, depth = 0, &block)
block.call(e, depth)
unless e.expressions.empty?
e.each do |s|
walk(s, depth+1, &block)
end
end
end
def capture_groups
capture_groups = []
walk do |e, depth|
capture_groups << e.to_s if Regexp::Expression::Group::Capture === e
end
capture_groups
end
end
I can then use this in my application to make replacements in my string - the final goal - along these lines:
from = /^\/search\/(.*)$/
to = '/buy/$1'
to_as_regexp = to.dup
# I should probably make this gsub tighter
from.capture_groups.each_with_index do |capture, idx|
to_as_regexp.gsub!("$#{idx+1}", capture)
end
to_as_regexp = /^#{to_as_regexp}$/
# to_as_regexp = /^\/buy\/(.*)$/
I hope this helps someone else out.