I want to write a regex pattern to look for a delimiter within a string such as this: '//;\n1;2' where the character ; serves as the delimiter for the given string. I'm trying to solve a common coding kata problem, so I don't necessarily want an answer in code, but an idea as to how I can accomplish this with regex. This delimiter is subject to change and could be any character. Any ideas? Here's what I've written so far in my production file.
class StringCalculator
def add(numbers)
return 0 if numbers.empty?
return nil if numbers.end_with?('\n')
numbers.tr!('\n', ',')
numbers.split(',').map(&:to_i).reduce(:+)
end
end
Spec:
context 'when passed different delimiters' do
it 'should support semicolons' do
expect(#calculator.add('//;\n1;2')).to eq(3)
end
it 'should support percents' do
expect(#calculator.add('//%\n1%5')).to eq(6)
end
end
Related
# A name is valid is if satisfies all of the following:
# - contains at least a first name and last name, separated by spaces
# - each part of the name should be capitalized
def is_valid_name(str)
word_Split = str.split(" ")
if word_Split.length >= 2
word_Split.each do |x|
if x == x.capitalize
else
return false
end
end
return true
end
return false
end
puts is_valid_name("Kush Patel") # => true
puts is_valid_name("Daniel") # => false
puts is_valid_name("Robert Downey Jr") # => true
puts is_valid_name("ROBERT DOWNEY JR") # => false
In this code above, I understand the placement of the the first if/else statement. What I am having issues understanding is, do the 2 ends under the first return false, close the .each loop and the method?
What does the return true and return false outside the loop even do? I'm trying understand to read this code.
Since I am still new to coding, I have been writing if/else statements as: if this, do that. else, do this. or use an elsif in between.
I appreciate any help in understanding how this reads. I took a look at c++ if/else statements and they were a little easier to read. Using { } to separate them. Thanks for your patience and understanding.
The first thing I would do is take the code and fix the indentation and add newlines where helpful to increase readability. Next I've added inline comments to explain a bit of what is going on.
def is_valid_name(str)
word_Split = str.split(" ")
# at least two words?
if word_Split.length >= 2
word_Split.each do |x|
# This if block is empty and control falls through to [1]
# once all the words in the loop pass the check. This if/else
# could be replaced by `unless`
if x == x.capitalize
else
# Immediately exit the function and return false since
# a single word in the loop was not capitalized
return false
end
end
# [1] This handles the case where
# each word had a proper first letter capitalized
return true
end
# There was zero or one word, return false.
# Technically in ruby, the word `return` is optional
# for the last line in the method
return false
end
In this code above, I understand the placement of the the first if/else statement. What I am having issues understanding is, do the 2 ends under the first return false, close the .each loop and the method?
No. The two end immediately after the first return false closes the if/else statement, then the loop. The two follow ends close the first if statement and then the method.
What does the return true and return false outside the loop even do? I'm trying understand to read this code.
The return true handles the case all the words (at least 2 or more) all started with a capital letter. The return false handles the case where there is only zero or one word after attempting the split.
As a former C/C++ dev, I recommend you look at Ruby conditionals as a good point to continue reading. Specifically learn about if, else, elsif, and unless as well as trailing conditionals.
I think Matthew already gave great answers to your questions.
Learning a new language isn't easy especially because you do not only have to understand the different syntax and the different concepts but also need to get familiar with common idioms and common coding styles.
The Ruby community, for example, uses underscore variable names, prefers methods names to end with a question mark when the method returns an boolean, and because of implicit returns is avoids using explicit return values in many case.
That said, I agree that the code example in your question is hard to understand. Mostly for the reason that it doesn't follow common Ruby idioms and best practices.
Just as an example I would like to show a refactored version that behaves exactly the same and that IMHO looks more Rubyish:
def valid_name?(name)
parts = name.split
parts.size >= 2 &&
parts.all? { |part| part == part.capitalize }
end
valid_name?("Kush Patel") # => true
valid_name?("Daniel") # => false
valid_name?("Robert Downey Jr") # => true
valid_name?("ROBERT DOWNEY JR") # => false
You can use a regular expression for that.
def valid_name?(name)
name.match? /\A(?:\p{Upper}\p{Lower}+)(?: +\p{Upper}\p{Lower}+)+\z/
end
valid_name?("Kush Patel") #=> true
valid_name?("Cher") #=> false
valid_name?("Robert Downey Jr") #=> true
valid_name?("ROBERT DOWNEY JR") #=> false
See String#match?.
The regular expression can be made self-documenting by writing it in free-spacing mode.
/\A # match beginning of string
(?: # begin a non-capture group
\p{Upper} # match an upper-case Unicode letter
\p{Lower}+ # match one or more lower-case Unicode letters
) # end non-capture group
(?: # begin a non-capture group
\ + # match one or more spaces
\p{Upper} # match an upper-case Unicode letter
\p{Lower}+ # match one or more lower-case Unicode letters
)+ # end non-capture group and execute it one or more times
\z # match end of string
/x # employ free-spacing regex definition mode
\p{Upper} (\p{Lower}) can alternatively be written \p{Lu} (\p{Ll}) or [[:upper:]] ([[:lower:]]). If ASCII is satisfactory one could use [A-Z] ([a-z]). See Regexp.
Note that I used ..(?: +\p{Upper}\p{Lower}+).. within the method definition and ..(?:\ +\p{Upper}\p{Lower}+).. when expressing the regular expression in free-spacing mode (the space character is escaped in the latter). That's because free-spacing mode removes all spaces before the expression is parsed. To preserve spaces they must be escaped, placed within a character class ([ ]) or replaced with \p{Zs}. I would advise against writing \s, [[:space:]] and \p{Space} as they also match newlines (\n) which can be problematic in some cases.
I am writing a function which can have two potential forms of input:
This is {a {string}}
This {is} a {string}
I call the sub-strings wrapped in curly-brackets "tags". I could potentially have any number of tags in a string, and they could be nested arbitrarily deep.
I've tried writing a regular expression to grab the tags, which of course fails on the nested tags, grabbing {a {string}, missing the second curly bracket. I can see it as a recursive problem, but after staring at the wrong answer too long I feel like I'm blind to seeing something really obvious.
What can I do to separate out the potential tags into parts so that they can be processed and replaced?
The More Complicated Version
def parseTags( oBody, szText )
if szText.match(/\{(.*)\}/)
szText.scan(/\{(.*)\}/) do |outers|
outers.each do |blah|
if blah.match(/(.*)\}(.*)\{(.*)/)
blah.scan(/(.*)\}(.*)\{(.*)/) do |inners|
inners.each do |tags|
szText = szText.sub("\{#{tags}\}", parseTags( oBody, tags ))
end
end
else
szText = szText.sub("\{#{blah}\}", parseTags( oBody, blah ))
end
end
end
end
if szText.match(/(\w+)\.(\w+)(?:\.([A-Za-z0-9.\[\]": ]*))/)
func = $1+"_"+$2
begin
szSub = self.send func, oBody, $3
rescue Exception=>e
szSub = "{Error: Function #{$1}_#{$2} not found}"
$stdout.puts "DynamicIO Error Encountered: #{e}"
end
szText = szText.sub("#{$1}.#{$2}#{$3!=nil ? "."+$3 : ""}", szSub)
end
return szText
end
This was the result of tinkering too long. It's not clean, but it did work for a case similar to "1" - {help.divider.red.sys.["{pc.login}"]} is replaced with ---------------[ Duwnel ]---------------. However, {pc.attr.str.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.pre.dotmode} {ansi.col.red}|{ansi.col.reset} {pc.attr.int.dotmode} implodes brilliantly, with random streaks of red and swatches of missing text.
To explain, anything marked {ansi.col.red} marks an ansi red code, reset escapes the color block, and {pc.attr.XXX.dotmode} displays a number between 1 and 10 in "o"s.
As others have noted, this is a perfect case for a parsing engine. Regular expressions don't tend to handle nested pairs well.
Treetop is an awesome PEG parser that you might be interested in taking a look at. The main idea is that you define everything that you want to parse (including whitespace) inside rules. The rules allow you to recursively parse things like bracket pairs.
Here's an example grammar for creating arrays of strings from nested bracket pairs. Usually grammars are defined in a separate file, but for simplicity I included the grammar at the end and loaded it with Ruby's DATA constant.
require 'treetop'
Treetop.load_from_string DATA.read
parser = BracketParser.new
p parser.parse('This is {a {string}}').value
#=> ["This is ", ["a ", ["string"]]]
p parser.parse('This {is} a {string}').value
#=> ["This ", ["is"], " a ", ["string"]]
__END__
grammar Bracket
rule string
(brackets / not_brackets)+
{
def value
elements.map{|e| e.value }
end
}
end
rule brackets
'{' string '}'
{
def value
elements[1].value
end
}
end
rule not_brackets
[^{}]+
{
def value
text_value
end
}
end
end
I would recommend instead of fitting more complex regular expressions to this problem, that you look into one of Ruby's grammar-based parsing engines. It is possible to design recursive and nested grammars in most of these.
parslet might be a good place to start for your problem. The erb-alike example, although it does not demonstrate nesting, might be closest to your needs: https://github.com/kschiess/parslet/blob/master/example/erb.rb
Given a regular expression:
/say (hullo|goodbye) to my lovely (.*)/
and a string:
"my $2 is happy that you said $1"
What is the best way to obtain a regular expression from the string that contains the capture groups in the regular expression? That is:
/my (.*) is happy that you said (hullo|goodbye)/
Clearly I could use regular expressions on a string representation of the original regular expression, but this would probably present difficulties with nested capture groups.
I'm using Ruby. My simple implementation so far goes along the lines of:
class Regexp
def capture_groups
self.to_s[1..-2].scan(/\(.*?\)/)
end
end
regexp.capture_groups.each_with_index do |capture, idx|
string.gsub!("$#{idx+1}", capture)
end
/^#{string}$/
i guess you need to create your own function that would do this:
create empty dictionaries groups and active_groups and initialize counter = 1
iterate over the characters in the string representation:
if current character = '(' and previous charaster != \:
add counter key to active_groups and increase counter
add current character to all active_groups
if current character = ')' and previous charaster != \:
remove the last item (key, value) from active_groups and add it to groups
convert groups to an array if needed
You might also want to implement:
ignore = True between unescaped '[' and ']'
reset counter if current character = '|' and active_groups is empty (or decrease counter if active_group is not empty)
UPDATES from comments:
ingore non-capturing groups starting with '(?:'
So once I realised that what I actually need is a regular expression parser, things started falling into place. I discovered this project:
https://github.com/dche/randall
which can generate strings that match a regular expression. It defines a regular expression grammar using http://treetop.rubyforge.org/. Unfortunately the grammar it defines is incomplete, though useful for many cases.
I also stumbled past https://github.com/mjijackson/citrus, which does a similar job to Treetop.
I then found this mind blowing gem:
https://github.com/ammar/regexp_parser
which defines a full regexp grammar and parses a regular expression into a walkable tree. I was then able to walk the tree and pick out the parts of the tree I wanted (the capture groups).
Unfortunately there was a minor bug, fixed in my fork: https://github.com/LaunchThing/regexp_parser.
Here's my patch to Regexp, that uses the fixed gem:
class Regexp
def parse
Regexp::Parser.parse(self.to_s, 'ruby/1.9')
end
def walk(e = self.parse, depth = 0, &block)
block.call(e, depth)
unless e.expressions.empty?
e.each do |s|
walk(s, depth+1, &block)
end
end
end
def capture_groups
capture_groups = []
walk do |e, depth|
capture_groups << e.to_s if Regexp::Expression::Group::Capture === e
end
capture_groups
end
end
I can then use this in my application to make replacements in my string - the final goal - along these lines:
from = /^\/search\/(.*)$/
to = '/buy/$1'
to_as_regexp = to.dup
# I should probably make this gsub tighter
from.capture_groups.each_with_index do |capture, idx|
to_as_regexp.gsub!("$#{idx+1}", capture)
end
to_as_regexp = /^#{to_as_regexp}$/
# to_as_regexp = /^\/buy\/(.*)$/
I hope this helps someone else out.
I'm working with mails, and names and subjects sometimes come q-encoded, like this:
=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?=
Is there a way to decode them in Ruby? It seems TMail should take care of it, but it's not doing it.
I use this to parse email subjects:
You could try the following:
str = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
if m = /=\?([A-Za-z0-9\-]+)\?(B|Q)\?([!->#-~]+)\?=/i.match(str)
case m[2]
when "B" # Base64 encoded
decoded = Base64.decode64(m[3])
when "Q" # Q encoded
decoded = m[3].unpack("M").first.gsub('_',' ')
else
p "Could not find keyword!!!"
end
Iconv.conv('utf-8',m[1],decoded) # to convert to utf-8
end
Ruby includes a method of decoding Quoted-Printable strings:
puts "Pablo_Fern=C3=A1ndez".unpack "M"
# => Pablo_Fernández
But this doesn't seem to work on your entire string (including the =?UTF-8?Q? part at the beginning. Maybe you can work it out from there, though.
This is a pretty old question but TMail::Unquoter (or its new incarnation Mail::Encodings) does the job as well.
TMail::Unquoter.unquote_and_convert_to(str, 'utf-8' )
or
Mail::Encodings.unquote_and_convert_to( str, 'utf-8' )
Decoding on a line-per-line basis:
line.unpack("M")
Convert STDIN or file provided input of encoded strings into a decoded output:
if ARGV[0]
lines = File.read(ARGV[0]).lines
else
lines = STDIN.each_line.to_a
end
puts lines.map { |c| c.unpack("M") }.join
This might help anyone wanting to test an email. delivery.html_part is normally encoded, but can be decoded to a straight HTML body using .decoded.
test "email test" do
UserMailer.confirm_email(user).deliver_now
assert_equal 1, ActionMailer::Base.deliveries.size
delivery = ActionMailer::Base.deliveries.last
assert_equal "Please confirm your email", delivery.subject
assert delivery.html_part.decoded =~ /Click the link below to confirm your email/ # DECODING HERE
end
The most efficient and up to date solution it seems to use the value_decode method of the Mail gem.
> Mail::Encodings.value_decode("=?UTF-8?Q?Greg_of_Google?=")
=> "Greg of Google"
https://www.rubydoc.info/github/mikel/mail/Mail/Encodings#value_decode-class_method
Below is Ruby code you can cut-and-paste, if inclined. It will run tests if executed directly with ruby, ruby ./copy-pasted.rb. As done in the code, I use this module as a refinement to the String core class.
A few remarks on the solution:
Other solutions perform .gsub('_', ' ') on the unpacked string. However, I do not believe this is correct, and can result in an incorrect decoding depending on the charsets. RFC2047 Section 4.2 (2) indicates "_ always represents hexidecimal 20", so it seems correct to first substitute =20 for _ then rely on the unpack result. (This also makes the implementation more elegant.) This is also discussed in an answer to a related question.
To be more instructive, I have written the regular expression in free-spacing mode to allow comments (I find this generally helpful for complex regular expressions). If you adjust the regular expression, take note that free-spacing mode changes the matching of white-space, which must then be done escaped or as a character class (as in the code). I've also added the regular expression on regex101, so you can read an explanation of the named capture groups, lazy quantifiers, etc. and experiment yourself.
The regular expression will absorb space ( ; but not TAB or newline) between multiple Q-encoded phrases in a single string, as shown with string test_4. This is because RFC2047 Section 5 (1) indicates that multiple Q encoded phrases must be separated from each other by linear white-space. Depending on your use-case, absorbing the white-space may not be desired.
The regular expression code named capture permits unexpected quoted printable codes (other than [bBqQ] so that a match will occur and the code can raise an error. This helps me to detect unexpected values when processing text. Change the regular expression named capture for code to [bBqQ] if you do not want this behaviour. (There will be no match and the original string will be returned.)
It makes use of the global Regexp.last_match as a convenience in the gsub block. You may need to take care if using this in multi-threaded code, I have not given this any consideration.
Additional references and reading:
https://en.wikipedia.org/wiki/Quoted-printable
https://en.wikipedia.org/wiki/MIME#Encoded-Word
require "minitest/autorun"
module QuotedPrintableDecode
class UnhandledCodeError < StandardError
def initialize(code)
super("Unhandled quoted printable code: '#{code}'.")
end
end
##qp_text_regex = %r{
=\? # Opening literal: `=?`
(?<charset>[^\?]+) # Character set, e.g. "Windows-1252" in `=?Windows-1252?`
\? # Literal: `?`
(?<code>[a-zA-Z]) # Encoding, e.g. "Q" in `?Q?` (`B`ase64); [BbQq] expected, others raise
\? # Literal: `?`
(?<text>[^\?]+?) # Encoded text, lazy (non-greedy) matched, e.g. "Foo_bar" in `?Foo_bar?`
\?= # Closing literal: `?=`
(?:[ ]+(?==\?))? # Optional separating linear whitespace if another Q-encode follows
}x # Free-spacing mode to allow above comments, also changes whitespace match
refine String do
def decode_q_p(to: "UTF-8")
self.gsub(##qp_text_regex) do
code, from, text = Regexp.last_match.values_at(:code, :charset, :text)
q_p_charset_to_charset(code, text, from, to)
end
end
private
def q_p_charset_to_charset(code, text, from, to)
case code
when "q", "Q"
text.gsub("_", "=20").unpack("M")
when "b", "B"
text.unpack("m")
else
raise UnhandledCodeError.new(code)
end.first.encode(to, from)
end
end
end
class TestQPDecode < Minitest::Test
using QuotedPrintableDecode
def test_decode_single_utf_8_phrase
encoded = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
assert_equal encoded.decode_q_p, "J. Pablo Fernández"
end
def test_decoding_preserves_space_between_unencoded_phrase
encoded = "=?utf-8?Q?Alfred_Sanford?= <me#example.com>"
assert_equal encoded.decode_q_p, "Alfred Sanford <me#example.com>"
end
def test_decodinge_multiple_adjacent_phrases_absorbs_separating_whitespace
encoded = "=?Windows-1252?Q?Foo_-_D?= =?Windows-1252?Q?ocument_World=9617=96520;_Recor?= =?Windows-1252?Q?d_People_to_C?= =?Windows-1252?Q?anada's_History?="
assert_equal encoded.decode_q_p, "Foo - Document World–17–520; Record People to Canada's History"
end
def test_decoding_string_without_encoded_phrases_preserves_original
encoded = "Contains no QP phrases"
assert_equal encoded.decode_q_p, encoded
end
def test_unhandled_code_raises
klass = QuotedPrintableDecode::UnhandledCodeError
message = "Unhandled quoted printable code: 'Z'."
encoded = "=?utf-8?Z?Unhandled code Z?="
raised_error = assert_raises(klass) { encoded.decode_q_p }
assert_equal message, raised_error.message
end
end
I have a string
input = "maybe (this is | that was) some ((nice | ugly) (day |night) | (strange (weather | time)))"
How is the best method in Ruby to parse this string ?
I mean the script should be able to build sententes like this :
maybe this is some ugly night
maybe that was some nice night
maybe this was some strange time
And so on, you got the point...
Should I read the string char by char and bulid a state machine with a stack to store the parenthesis values for later calculation, or is there a better approach ?
Maybe a ready, out of the box library for such purpose ?
Try Treetop. It is a Ruby-like DSL to describe grammars. Parsing the string you've given should be quite easy, and by using a real parser you'll easily be able to extend your grammar later.
An example grammar for the type of string that you want to parse (save as sentences.treetop):
grammar Sentences
rule sentence
# A sentence is a combination of one or more expressions.
expression* <Sentence>
end
rule expression
# An expression is either a literal or a parenthesised expression.
parenthesised / literal
end
rule parenthesised
# A parenthesised expression contains one or more sentences.
"(" (multiple / sentence) ")" <Parenthesised>
end
rule multiple
# Multiple sentences are delimited by a pipe.
sentence "|" (multiple / sentence) <Multiple>
end
rule literal
# A literal string contains of word characters (a-z) and/or spaces.
# Expand the character class to allow other characters too.
[a-zA-Z ]+ <Literal>
end
end
The grammar above needs an accompanying file that defines the classes that allow us to access the node values (save as sentence_nodes.rb).
class Sentence < Treetop::Runtime::SyntaxNode
def combine(a, b)
return b if a.empty?
a.inject([]) do |values, val_a|
values + b.collect { |val_b| val_a + val_b }
end
end
def values
elements.inject([]) do |values, element|
combine(values, element.values)
end
end
end
class Parenthesised < Treetop::Runtime::SyntaxNode
def values
elements[1].values
end
end
class Multiple < Treetop::Runtime::SyntaxNode
def values
elements[0].values + elements[2].values
end
end
class Literal < Treetop::Runtime::SyntaxNode
def values
[text_value]
end
end
The following example program shows that it is quite simple to parse the example sentence that you have given.
require "rubygems"
require "treetop"
require "sentence_nodes"
str = 'maybe (this is|that was) some' +
' ((nice|ugly) (day|night)|(strange (weather|time)))'
Treetop.load "sentences"
if sentence = SentencesParser.new.parse(str)
puts sentence.values
else
puts "Parse error"
end
The output of this program is:
maybe this is some nice day
maybe this is some nice night
maybe this is some ugly day
maybe this is some ugly night
maybe this is some strange weather
maybe this is some strange time
maybe that was some nice day
maybe that was some nice night
maybe that was some ugly day
maybe that was some ugly night
maybe that was some strange weather
maybe that was some strange time
You can also access the syntax tree:
p sentence
The output is here.
There you have it: a scalable parsing solution that should come quite close to what you want to do in about 50 lines of code. Does that help?