How gsub works in ruby with meta characters in input - ruby

I need help in understanding how the following works.
"middl'-.*$%ddlemiddlemiddlemiddlemiddlemiddlemiExcess".gsub(/[^a-zA-Z'-.]/, '')
# => "middl'-.*ddlemiddlemiddlemiddlemiddlemiddlemiExcess"
"middl'-.*$%ddlemiddlemiddlemiddlemiddlemiddlemiExcess".gsub(/[^a-zA-Z.'-]/, '')
# => "middl'-.ddlemiddlemiddlemiddlemiddlemiddlemiExcess"
When I give /[^a-zA-Z'-.]/, then the star is not removed, but in the second example, the star is removed. Why?
I want the result after gsub to have only letters (a-zA-Z), period (.), hypen (-), single apostphe (') to exist. Just by changing the period position inside regular expression the output is different ?

In /[^a-zA-Z'-.]/ hyphen is treated as range delimiter, exactly as in A-Z before. The range is:
▶ ("'"..'.').to_a
#⇒ ["'", "(", ")", "*", "+", ",", "-", "."] # note asterisk
In /[^a-zA-Z.'-]/ hyphen is the last symbol and hence it is treated as hyphen itself.

Related

How to remove strings that end with a particular character in Ruby

Based on "How to Delete Strings that Start with Certain Characters in Ruby", I know that the way to remove a string that starts with the character "#" is:
email = email.gsub( /(?:\s|^)#.*/ , "") #removes strings that start with "#"
I want to also remove strings that end in ".". Inspired by "Difference between \A \z and ^ $ in Ruby regular expressions" I came up with:
email = email.gsub( /(?:\s|$).*\./ , "")
Basically I used gsub to remove the dollar sign for the carrot and reversed the order of the part after the closing parentheses (making sure to escape the period). However, it is not doing the trick.
An example I'd like to match and remove is:
"a8&23q2aas."
You were so close.
email = email.gsub( /.*\.\s*$/ , "")
The difference lies in the fact that you didn't consider the relationship between string of reference and the regex tokens that describe the condition you wish to trigger. Here, you are trying to find a period (\.) which is followed only by whitespace (\s) or the end of the line ($). I would read the regex above as "Any characters of any length followed by a period, followed by any amount of whitespace, followed by the end of the line."
As commenters pointed out, though, there's a simpler way: String#end_with?.
I'd use:
words = %w[#a day in the life.]
# => ["#a", "day", "in", "the", "life."]
words.reject { |w| w.start_with?('#') || w.end_with?('.') }
# => ["day", "in", "the"]
Using a regex is overkill for this if you're only concerned with the starting or ending character, and, in fact, regular expressions will slow your code in comparison with using the built-in methods.
I would really like to stick to using gsub....
gsub is the wrong way to remove an element from an array. It could be used to turn the string into an empty string, but that won't remove that element from the array.
def replace_suffix(str,suffix)
str.end_with?(suffix)? str[0, str.length - suffix.length] : str
end

What does the regular expression [\w-] mean?

I checked the documentation, and cannot find what [\w-] means. Can anyone tell me what [\w-] means in Ruby?
The square brackets [] denote a character class. A character class will match any of the things inside it.
\w is a special class called "word characters". It is shorthand for [a-zA-Z0-9_], so it will match:
a-z (all lowercase letters)
A-Z (all uppercase letters)
0-9 (all digits)
_ (an underscore)
The class you are asking about, [\w-], is a class consisting of \w and -. So it will match the above list, plus hyphens (-).
Exactly as written, [\w-], this regex would match a single character, as long as it's in the above list, or is a dash.
If you were to add a quantifier to the end, e.g. [\w-]* or [\w-]+, then it would match any of these strings:
fooBar9
foo-Bar9
foo-Bar-9
-foo-Bar---9abc__34ab12d
And it would partially match these:
foo,Bar9 # match 'foo' - the ',' stops the match
-foo-Bar---9*bc__34ab12d # match '-foo-Bar---9', the '*' stops the match
\w Any word character (letter, number, underscore)
Here is what I think it is doing : Go to Rubular and try it as follow:
regex_1 /\w-/
String : f-oo
regext_1 will only match f- and will stop right at - ignoring any \w .. the rest of the string oo
Whereas :
regex_2 /[\w-]/
string : f-oo
regex_2 will match the entire string plus the special char - .. f-oo
.. Also , tested the case of a string being like f-1oo , and the second regex stopped the match at f- Hence, - is followed by a \d
==========
I belive the whole point of [] is to continue matching before and after - . Here are some variations I tried from irb.
irb(main):004:0> "blah-blah".scan(/\w-/)
=> ["h-"]
irb(main):005:0> "blah-blah".scan(/[\w-]/)
=> ["b", "l", "a", "h", "-", "b", "l", "a", "h"]
irb(main):006:0> "blah-blah".scan(/\w-\w/)
=> ["h-b"]
irb(main):007:0> "blah-blah".scan(/\w-\w*$/)
=> ["h-blah"]
irb(main):008:0> "blah-blah".scan(/\w*-\w*$/)
=> ["blah-blah"]

ruby - weird duplication with backtick in gsub [duplicate]

s = "#main= 'quotes'
s.gsub "'", "\\'" # => "#main= quotes'quotes"
This seems to be wrong, I expect to get "#main= \\'quotes\\'"
when I don't use escape char, then it works as expected.
s.gsub "'", "*" # => "#main= *quotes*"
So there must be something to do with escaping.
Using ruby 1.9.2p290
I need to replace single quotes with back-slash and a quote.
Even more inconsistencies:
"\\'".length # => 2
"\\*".length # => 2
# As expected
"'".gsub("'", "\\*").length # => 2
"'a'".gsub("'", "\\*") # => "\\*a\\*" (length==5)
# WTF next:
"'".gsub("'", "\\'").length # => 0
# Doubling the content?
"'a'".gsub("'", "\\'") # => "a'a" (length==3)
What is going on here?
You're getting tripped up by the specialness of \' inside a regular expression replacement string:
\0, \1, \2, ... \9, \&, \`, \', \+
Substitutes the value matched by the nth grouped subexpression, or by the entire match, pre- or postmatch, or the highest group.
So when you say "\\'", the double \\ becomes just a single backslash and the result is \' but that means "The string to the right of the last successful match." If you want to replace single quotes with escaped single quotes, you need to escape more to get past the specialness of \':
s.gsub("'", "\\\\'")
Or avoid the toothpicks and use the block form:
s.gsub("'") { |m| '\\' + m }
You would run into similar issues if you were trying to escape backticks, a plus sign, or even a single digit.
The overall lesson here is to prefer the block form of gsub for anything but the most trivial of substitutions.
s = "#main = 'quotes'
s.gsub "'", "\\\\'"
Since \it's \\equivalent if you want to get a double backslash you have to put four of ones.
You need to escape the \ as well:
s.gsub "'", "\\\\'"
Outputs
"#main= \\'quotes\\'"
A good explanation found on an outside forum:
The key point to understand IMHO is that a backslash is special in
replacement strings. So, whenever one wants to have a literal
backslash in a replacement string one needs to escape it and hence
have [two] backslashes. Coincidentally a backslash is also special in a
string (even in a single quoted string). So you need two levels of
escaping, makes 2 * 2 = 4 backslashes on the screen for one literal
replacement backslash.
source

Why do these two different regex's return different results in Ruby based upon position of underscore

I have the following:.
[11] pry(main)> "ab BN123-4.56".scan(/BN([0-9_\.-]+)/)
=> [["123-4.56"]]
[12] pry(main)> "ab BN123-4.56".scan(/BN([0-9\.-_]+)/)
=> [["123"]]
I am unsure why the second one with the the underscore at the end behaves differently than the first. How is it being interpreted by RegEx parser to make it different?
It's because you have the hyphen (-) placed in the middle of the character class without being escaped.
Within a character class [], you can place a hyphen (-) as the first or last character. If you place the hyphen anywhere else you need to escape it (\-) in order to be matched.
"ab BN123-4.56".scan(/BN([0-9_\.-]+)/) # => '123-4.56'
"ab BN123-4.56".scan(/BN([0-9\.\-_]+)/) # => '123-4.56'
Note: You don't really need to escape the dot (.) either, so you could rewrite this as..
"ab BN123-4.56".scan(/BN([0-9_.-]+)/) # => '123-4.56'
Or even the following if you choose to place it in the middle of the character class.
"ab BN123-4.56".scan(/BN([0-9.\-_]+)/) # => '123-4.56'
The hyphen is messing things up, not the underscore.
- is a special character inside a character class, indicating a range. One way to escape it is to put it at the beginning or the end of the class: [...-].
So [_.-] checks for a character, either _ or . or -.
And [.-_] check for a character, in the range "from . to _".
Illustration
BN([0-9.\-_]+) does what you expect and selects 123-4.56 from ab BN123-4.56.
The hyphen inside of square brackets [] indicates a range. To use a literal hyphen escape it like you do special characters with a \

a regex for numbers, letters, and -_. but the -_. can't be last characters of the captured fragment

How would I create a regex that captures as follows:
"bs BN12.3.-".scan(regex) # => [["12.3"]]
where the trailing periods, underscores, and hyphens are not included in the capture but the internal ones are? I tried the following:
"bs BN12.3.-".scan(/BN([a-zA-Z0-9\-_\.]+)/) # => [["12.3.-"]]
If you make your existing group non-greedy by ending it with ? as in ([a-zA-Z0-9\-_\.]+?) and follow it with an expression that matches the other characters [-._]* before terminating with $, you should get what you need:
"bs BN12.3.-".scan(/BN([a-zA-Z0-9\-_.]+?)[-._]*$/)
=> [["12.3"]]
# Different input strings...
"bs BN12.3".scan(/BN([a-zA-Z0-9\-_.]+?)[-._]*$/)
=> [["12.3"]]
2.1.0 :012 > "bs BN12.".scan(/BN([a-zA-Z0-9\-_.]+?)[-._]*$/)
=> [["12"]]
"bs BN12.3-4.5______".scan(/BN([a-zA-Z0-9\-_.]+?)[-._]*$/)
=> [["12.3-4.5"]]
(Note: most of the punctuation characters don't require escaping inside the [] character class. The hyphen does in its current position, but wouldn't if move to the end of the [])
Addendum: To prevent any non-alphabetic, non-digit character at the end, the final character class can be [^A-Za-z0-9]*
"bs BN12.3-4.5______".scan(/BN([a-zA-Z0-9\-_.]+?)[^A-Za-z0-9]*$/)
=> [["12.3-4.5"]]
Another option would be to set one of [a-zA-Z0-9] to end the match and change the quantifier of [a-zA-Z0-9\-_\.] from + to * any amount of times:
(?<=BN)[-a-zA-Z0-9_.]*[a-zA-Z0-9]
Additional used a lookbehind for matching BN to start the match for avoiding the capturing group.
To make it shorter, might use some shorthands:
(?<=BN)[-\w.]*[^\W_]
Test on regex101
Wrap the ones you don't want to capture in a non-capture group, for example, instead of saying something like ([-\.\_]*) you would put the non capture symbol ?:, like (?:[-\.\_]*).

Resources