Strange Ruby behavior when replacing with \'X string - ruby

I am generating some RTF strings and need the \'bd code. I am having problem with the sub and gsub commands.
puts 'abc'.sub('a',"\\'B") => "bcBbc"
The statement replaces the target with 'B' without the \', and copies the remainder of the string to the front. I've tried a lot of variations and it seems that the problem is the \' itself.
I have ways of getting around this, but I'd like to know whether I'm doing something fundamentally wrong or whether this is a quirk with Ruby.
Thanks

From the Ruby documentation:
Similarly, \&, \', \`, and \+ correspond to special variables, $&, $', $`, and $+, respectively.
And here, the documentation goes on to say:
$~ is equivalent to ::last_match;
$& contains the complete matched text;
$` contains string before match;
$' contains string after match;
$1, $2 and so on contain text matching first, second, etc capture group;
$+ contains last capture group.
Example:
m = /s(\w{2}).*(c)/.match('haystack') #=> #<MatchData "stac" 1:"ta" 2:"c">
$~ #=> #<MatchData "stac" 1:"ta" 2:"c">
Regexp.last_match #=> #<MatchData "stac" 1:"ta" 2:"c">
$& #=> "stac"
# same as m[0]
$` #=> "hay"
# same as m.pre_match
$' #=> "k"
# same as m.post_match
$1 #=> "ta"
# same as m[1]
$2 #=> "c"
# same as m[2]
$3 #=> nil
# no third group in pattern
$+ #=> "c"
# same as m[-1]
So, \' in a substitution replacement string has a special meaning. It means "The portion of the original string after the match" - which, in this case, is "bc".
So rather than getting \'Bbc, you get bcBbc
Therefore unfortunately, in this odd scenario, you'd need to double-escape the backslashes:
puts 'abc'.sub('a',"\\\\'B") => "\'Bbc"

Related

How to use Regexp.union to match a character at the beginning of my string

I'm using Ruby 2.4. I want to match an optional "a" or "b" character, followed by an arbitrary amount of white space, and then one or more numbers, but my regex's are failing to match any of these:
2.4.0 :017 > MY_TOKENS = ["a", "b"]
=> ["a", "b"]
2.4.0 :018 > str = "40"
=> "40"
2.4.0 :019 > str =~ Regexp.new("^[#{Regexp.union(MY_TOKENS)}]?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :020 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :021 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+$")
=> nil
I'm stumped as to what I'm doing wrong.
If they are single characters, just use MY_TOKENS.join inside the character class:
MY_TOKENS = ["a", "b"]
str = "40"
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/
# /^[ab]?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ first_regex
# 0
You can also integrate the Regexp.union, it might lead to some unexpected bugs though, because the flags of the outer regexp won't apply to the inner one :
second_regex = /^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?-mix:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ second_regex
# 0
The above regex looks a lot like what you did, but using // instead of Regexp.new prevents you from having to escape the backslashes.
You could use Regexp#source to avoid this behaviour :
third_regex = /^(?:#{Regexp.union(MY_TOKENS).source})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ third_regex
# 0
or simply build your regex union :
fourth_regex = /^(?:#{MY_TOKENS.join('|')})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ fourth_regex
# 0
The 3 last examples should work fine if MY_TOKENS has words instead of just characters.
first_regex, third_regex and fourth_regex should all work fine with /i flag.
As an example :
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/i
"A 40" =~ first_regex
# 0
I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.
Then you need to use
Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)
or
/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)
When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).
Do not forget to match the start of a string with \A and end of string with \z anchors.
Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.
To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):
Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")
or
/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i
The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).

Finding the first duplicate character in the string Ruby

I am trying to call the first duplicate character in my string in Ruby.
I have defined an input string using gets.
How do I call the first duplicate character in the string?
This is my code so far.
string = "#{gets}"
print string
How do I call a character from this string?
Edit 1:
This is the code I have now where my output is coming out to me No duplicates 26 times. I think my if statement is wrongly written.
string "abcade"
puts string
for i in ('a'..'z')
if string =~ /(.)\1/
puts string.chars.group_by{|c| c}.find{|el| el[1].size >1}[0]
else
puts "no duplicates"
end
end
My second puts statement works but with the for and if loops, it returns no duplicates 26 times whatever the string is.
The following returns the index of the first duplicate character:
the_string =~ /(.)\1/
Example:
'1234556' =~ /(.)\1/
=> 4
To get the duplicate character itself, use $1:
$1
=> "5"
Example usage in an if statement:
if my_string =~ /(.)\1/
# found duplicate; potentially do something with $1
else
# there is no match
end
s.chars.map { |c| [c, s.count(c)] }.drop_while{|i| i[1] <= 1}.first[0]
With the refined form from Cary Swoveland :
s.each_char.find { |c| s.count(c) > 1 }
Below method might be useful to find the first word in a string
def firstRepeatedWord(string)
h_data = Hash.new(0)
string.split(" ").each{|x| h_data[x] +=1}
h_data.key(h_data.values.max)
end
I believe the question can be interpreted in either of two ways (neither involving the first pair of adjacent characters that are the same) and offer solutions to each.
Find the first character in the string that is preceded by the same character
I don't believe we can use a regex for this (but would love to be proved wrong). I would use the method suggested in a comment by #DaveNewton:
require 'set'
def first_repeat_char(str)
str.each_char.with_object(Set.new) { |c,s| return c unless s.add?(c) }
nil
end
first_repeat_char("abcdebf") #=> b
first_repeat_char("abcdcbe") #=> c
first_repeat_char("abcdefg") #=> nil
Find the first character in the string that appears more than once
r = /
(.) # match any character in capture group #1
.* # match any character zero of more times
? # do the preceding lazily
\K # forget everything matched so far
\1 # match the contents of capture group 1
/x
"abcdebf"[r] #=> b
"abccdeb"[r] #=> b
"abcdefg"[r] #=> nil
This regex is fine, but produces the warning, "regular expression has redundant nested repeat operator '*'". You can disregard the warning or suppress it by doing something clunky, like:
r = /([^#{0.chr}]).*?\K\1/
where ([^#{0.chr}]) means "match any character other than 0.chr in capture group 1".
Note that a positive lookbehind cannot be used here, as they cannot contain variable-length matches (i.e., .*).
You could probably make your string an array and use detect. This should return the first char where the count is > 1.
string.split("").detect {|x| string.count(x) > 1}
I'll use positive lookahead with String#[] method :
"abcccddde"[/(.)(?=\1)/] #=> c
As a variant:
str = "abcdeff"
p str.chars.group_by{|c| c}.find{|el| el[1].size > 1}[0]
prints "f"

Regex string with grouping?

I see in the documentation I'm able to do:
/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
puts dollars #=> prints 3
I was wondering if this would be possible:
string = "\$(\?<dlr>\d+)\.(\?<cts>\d+)"
/#{Regexp.escape(string)}/ =~ "$3.67"
I get:
`<main>': undefined local variable or method `dlr' for main:Object (NameError)
There are a few mistakes in your approach. First of all, let's look at your string:
string = "\$(\?<dlr>\d+)\.(\?<cts>\d+)"
You escape the dollar sign with "\$", but that is the same as just writing "$", consider:
"\$" == "$"
#=> true
To actually end up with the string "backslash followed by dollar" you would need to write "\\$". The same thing applies to the decimal character classes, you would have to write "\\d" to end up with the correct string.
The question marks on the other hand are actually part of the regex syntax, so you do not want to escape these at all. I recommend using single quotes for your original string, because that makes the input much easier:
string = '\$(?<dlr>\d+)\.(?<cts>\d+)'
#=> "\\$(?<dlr>\\d+)\\.(?<cts>\\d+)"
The next issue is with Regexp.escape. Take a look at what regular expression it produces with the above string:
string = '\$(?<dlr>\d+)\.(?<cts>\d+)'
Regexp.escape(string)
#=> "\\\\\\$\\(\\?<dlr>\\\\d\\+\\)\\\\\\.\\(\\?<cts>\\\\d\\+\\)"
That's one level too much escaping. Regexp.escape can be used when you want to match the literal characters that are contained in the string. For example, the escaped regex above will match the source string itself:
/#{Regexp.escape(string)}/ =~ string
#=> 0 # matches at offset 0
Instead, you can use Regexp.new to treat the source as an actual regular expression.
The last issue is then how you access the match result. Obviously, you are getting a NoMethodError. You might think that the match result is stored in local variables called dlr and cts, but that is not the case. You have two options to access the match data:
Use Regexp.match, it will return a MatchData object as result
Use regexp =~ string and then access the last match data with the global variable $~
I prefer the former, because it is easier to read. The full code would then look like this:
string = '\$(?<dlr>\d+)\.(?<cts>\d+)'
regexp = Regexp.new(string)
result = regexp.match("$3.67")
#=> #<MatchData "$3.67" dlr:"3" cts:"67">
result[:dlr]
#=> "3"
result[:cts]
#=> "67"

Eval a string without string interpolation

AKA How do I find an unescaped character sequence with regex?
Given an environment set up with:
#secret = "OH NO!"
$secret = "OH NO!"
##secret = "OH NO!"
and given string read in from a file that looks like this:
some_str = '"\"#{:NOT&&:very}\" bad. \u262E\n##secret \\#$secret \\\\###secret"'
I want to evaluate this as a Ruby string, but without interpolation. Thus, the result should be:
puts safe_eval(some_str)
#=> "#{:NOT&&:very}" bad. ☮
#=> ##secret #$secret \###secret
By contrast, the eval-only solution produces
puts eval(some_str)
#=> "very" bad. ☮
#=> OH NO! #$secret \OH NO!
At first I tried:
def safe_eval(str)
eval str.gsub(/#(?=[{#$])/,'\\#')
end
but this fails in the malicious middle case above, producing:
#=> "#{:NOT&&:very}" bad. ☮
#=> ##secret \OH NO! \###secret
You can do this via regex by ensuring that there are an even number of backslashes before the character you want to escape:
def safe_eval(str)
eval str.gsub( /([^\\](?:\\\\)*)#(?=[{#$])/, '\1\#' )
end
…which says:
Find a character that is not a backslash [^\\]
followed by two backslashes (?:\\\\)
repeated zero or more times *
followed by a literal # character
and ensure that after that you can see either a {, #, or $ character.
and replace that with
the non-backslash-maybe-followed-by-even-number-of-backslashes
and then a backslash and then a #
How about not using eval at all? As per this comment in chat, all that's necessary are escaping quotes, newlines, and unicode characters. Here's my solution:
ESCAPE_TABLE = {
/\\n/ => "\n",
/\\"/ => "\"",
}
def expand_escapes(str)
str = str.dup
ESCAPE_TABLE.each {|k, v| str.gsub!(k, v)}
#Deal with Unicode
str.gsub!(/\\u([0-9A-Z]{4})/) {|m| [m[2..5].hex].pack("U") }
str
end
When called on your string the result is (in your variable environment):
"\"\"\#{:NOT&&:very}\" bad. ☮\n\##secret \\\#$secret \\\\\###secret\""
Although I would have preferred not to have to treat unicode specially, it is the only way to do it without eval.

What is between { }?

There is a piece of code:
def test_sub_is_like_find_and_replace
assert_equal "one t-three", "one two-three".sub(/(t\w*)/) { $1[0, 1] }
end
I found it really hard to understand what is between { } braces. Could anyone explain it please?
The {...} is a block. Ruby will pass the matched value to the block, and substitute the return value of the block back into the string. The String#sub documentation explains this more fully:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
Edit: Per Michael's comment, if you're confused about $1[0, 1], this is just taking the first capture ($1) and taking a substring of it (the first character, specifically). $1 is a global variable set to the contents of the first capture after a regex (in true Perl fashion), and since it's a string, the #[] operator is used to take a substring of it starting at index 0, with a length of 1.
The sub method either takes two arguments, first being the text to replace replace and the second being the replacement, or one argument being the text to replace and a block defining how to handle the replacement.
The block method is useful if you can't define your replacement as a simple string.
For example:
"foo".sub(/(\w)/) { $1.upcase }
# => "Foo"
"foo".sub(/(\w+)/) { $1.upcase }
# => "FOO"
The gsub method works the same way, but applies more than once:
"foo".gsub(/(\w)/) { $1.upcase }
# => "FOO"
In all cases, $1 refers to the contents captured by the brackets (\w).
Your code, illustrated
r = "one two-three".sub(/(t\w*)/) do
$1 # => "two"
$1[0, 1] # => "t"
end
r # => "one t-three"
sub is taking in a regular expression in it. The $1 is a reserved global variable that contains the match for the regular expression.
The brackets represent a block of code used that will substitute the match with the string returned by the block. In this case
puts $1
#=> "two"
puts $1[0, 1]
#=> "t"

Resources