Why is RegExp.escape not working in my Ruby expression? - ruby

I'm using Ruby 2.4. I have some strings that contain characters that have special meaning in regular expression. So to eliminate any possibility of those characters being interpreted as regexp characters, I use the "Regexp.escape" to attempt to escape them. However, I still seem unable to make teh below regular expression work ...
2.4.0 :005 > tokens = ["a", "b?", "c"]
=> ["a", "b?", "c"]
2.4.0 :006 > line = "1\ta\tb?\tc\t3"
=> "1\ta\tb?\tc\t3"
2.4.0 :009 > /#{Regexp.escape(tokens.join(" ")).gsub(" ", "\\s+")}/.match(line)
=> nil
How can I properly escape the characters before substituting the space with a "\s+" expression, whcih I do want interpreted as a regexp character?

When the Regexp.escape(tokens.join(" ")).gsub(" ", "\\s+") is executed, tokens.join(" ") yields a b? c, then the string is escaped -> a\ b\?\ c, and then the gsub is executed resulting in a\\s+b\?\\s+c. Now, line is 1 a b? c 3. So, all \\ are now matching a literal backslash, they no longer form an special regex metacharacter matching whitespace.
You need to escape the tokens, and join with \s+, or join with space and later replace the space with \s+:
/#{tokens.map { |n| Regexp.escape(n) }.join("\\s+")}/.match(line)
OR
/#{tokens.map { |n| Regexp.escape(n) }.join(" ").gsub(" ", "\\s+")}/.match(line)

Related

Select a string in regex with ruby

I have to clean a string passed in parameter, and remove all lowercase letters, and all special character except :
+
|
^
space
=>
<=>
so i have this string passed in parameter:
aA azee + B => C=
and i need to clean this string to have this result:
A + B => C
I do
string.gsub(/[^[:upper:][+|^ ]]/, "")
output: "A + B C"
I don't know how to select the => (and for <=>) string's with regex in ruby)
I know that if i add string.gsub(/[^[:upper:][+|^ =>]]/, "") into my regex, the last = in my string passed in parameter will be selected too
You can try an alternative approach: matching everything you want to keep then joining the result.
You can use this regex to match everything you want to keep:
[A-Z\d+| ^]|<?=>
As you can see this is just a using | and [] to create a list of strings that you want to keep: uppercase, numbers, +, |, space, ^, => and <=>.
Example:
"aA azee + B => C=".scan(/[A-Z\d+| ^]|<?=>/).join()
Output:
"A + B => C"
Note that there are 2 consecutive spaces between "A" and "+". If you don't want that you can call String#squeeze.
See regex in use here
(<?=>)|[^[:upper:]+|^ ]
(<?=>) Captures <=> or => into capture group 1
[^[:upper:]+|^ ] Matches any character that is not an uppercase letter (same as [A-Z]) or +, |, ^ or a space
See code in use here
p "aA azee + B => C=".gsub(/(<?=>)|[^[:upper:]+|^ ]/, '\1')
Result: A + B => C
r = /[a-z\s[:punct:]&&[^+ |^]]/
"The cat, 'Boots', had 9+5=4 ^lIVEs^ leF|t.".gsub(r,'')
#=> "T B 9+54 ^IVE^ F|"
The regular expression reads, "Match lowercase letters, whitespace and punctuation that are not the characters '+', ' ', '|' and '^'. && within a character class is the set intersection operator. Here it intersects the set of characters that match a-z\s[:punct:] with those that match [^+ |^]. (Note that this includes whitespaces other than spaces.) For more information search for "character classes also support the && operator" in Regexp.
I have not included '=>' and '<=>' as those, unlike '+', ' ', '|' and '^', are multi-character strings and therefore require a different approach than simply removing certain characters.

How to use Regexp.union to match a character at the beginning of my string

I'm using Ruby 2.4. I want to match an optional "a" or "b" character, followed by an arbitrary amount of white space, and then one or more numbers, but my regex's are failing to match any of these:
2.4.0 :017 > MY_TOKENS = ["a", "b"]
=> ["a", "b"]
2.4.0 :018 > str = "40"
=> "40"
2.4.0 :019 > str =~ Regexp.new("^[#{Regexp.union(MY_TOKENS)}]?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :020 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :021 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+$")
=> nil
I'm stumped as to what I'm doing wrong.
If they are single characters, just use MY_TOKENS.join inside the character class:
MY_TOKENS = ["a", "b"]
str = "40"
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/
# /^[ab]?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ first_regex
# 0
You can also integrate the Regexp.union, it might lead to some unexpected bugs though, because the flags of the outer regexp won't apply to the inner one :
second_regex = /^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?-mix:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ second_regex
# 0
The above regex looks a lot like what you did, but using // instead of Regexp.new prevents you from having to escape the backslashes.
You could use Regexp#source to avoid this behaviour :
third_regex = /^(?:#{Regexp.union(MY_TOKENS).source})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ third_regex
# 0
or simply build your regex union :
fourth_regex = /^(?:#{MY_TOKENS.join('|')})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ fourth_regex
# 0
The 3 last examples should work fine if MY_TOKENS has words instead of just characters.
first_regex, third_regex and fourth_regex should all work fine with /i flag.
As an example :
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/i
"A 40" =~ first_regex
# 0
I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.
Then you need to use
Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)
or
/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)
When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).
Do not forget to match the start of a string with \A and end of string with \z anchors.
Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.
To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):
Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")
or
/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i
The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).

How to split a string without getting an empty string inserted in the array

I'm having trouble splitting a character from a string using a regular expression, assuming there is a match.
I want to split off either an "m" or an "f" character from the first part of a string assuming the next character is one or more numbers followed by optional space characters, followed by a string from an array I have.
I tried:
2.4.0 :006 > MY_SEPARATOR_TOKENS = ["-", " to "]
=> ["-", " to "]
2.4.0 :008 > str = "M14-19"
=> "M14-19"
2.4.0 :011 > str.split(/^(m|f)\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)}/i)
=> ["", "M", "19"]
Notice the extraneous "" element at the beginning of my array and also notice that the last expression is just "19" whereas I would want everything else in the string ("14-19").
How do I adjust my regular expression so that only the parts of the expression that get split end up in the array?
I find match to be a bit more elegant when extracting characters from regular expressions in Ruby:
string = "M14-19"
string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14-19"]
# also can extract the symbols from match
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)
[[extract_string[:m], extract_string[:digits]]
=> ["M", "14-19"]
string = 'M14 to 14'
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14 to 14"]
TOKENS = ["-", " to "]
r = /
(?<=\A[mMfF]) # match the beginning of the string and then one
# of the 4 characters in a positive lookbehind
(?= # begin positive lookahead
\d+ # match one or more digits
[[:space:]]* # match zero or more spaces
(?:#{TOKENS.join('|')}) # match one of the tokens
) # close the positive lookahead
/x # free-spacing regex definition mode
(?:#{TOKENS.join('|')}) is replaced by (?:-| to ).
This can of course be written in the usual way.
r = /(?<=\A[mMfF])(?=\d+[[:space:]]*(?:#{TOKENS.join('|')}))/
When splitting on r you are splitting between two characters (between a positive lookbehind and a positive lookahead) so no characters are consumed.
"M14-19".split r
#=> ["M", "14-19"]
"M14 to 19".split r
#=> ["M", "14 to 19"]
"M14 To 19".split r
#=> ["M14 To 19"]
If it is desired that ["M", "14 To 19"] be returned in the last example, change [mMfF] to [mf] and /x to /xi.
You have a bug brewing in your code. Don't get in the habit of doing this:
#{Regexp.union(MY_SEPARATOR_TOKENS)}
You're setting yourself up with a very hard to debug problem.
Here's what's happening:
regex = Regexp.union(%w(a b)) # => /a|b/
/#{regex}/ # => /(?-mix:a|b)/
/#{regex.source}/ # => /a|b/
/(?-mix:a|b)/ is an embedded sub-pattern with its set of the regex flags m, i and x which are independent of the surrounding pattern's settings.
Consider this situation:
'CAT'[/#{regex}/i] # => nil
We'd expect that the regular expression i flag would match because it's ignoring case, but the sub-expression still only allows only lowercase, causing the match to fail.
Using the bare (a|b) or adding source succeeds because the inner expression gets the main expression's i:
'CAT'[/(a|b)/i] # => "A"
'CAT'[/#{regex.source}/i] # => "A"
See "How to embed regular expressions in other regular expressions in Ruby" for additional discussion of this.
The empty element will always be there if you get a match, because the captured part appears at the beginning of the string and the string between the start of the string and the match is added to the resulting array, be it an empty or non-empty string. Either shift/drop it once you get a match, or just remove all empty array elements with .reject { |c| c.empty? } (see How do I remove blank elements from an array?).
Then, 14- is eaten up (consumed) by the \d+[[:space:]]... pattern part - put it into a (?=...) lookahead that will just check for the pattern match, but won't consume the characters.
Use something like
MY_SEPARATOR_TOKENS = ["-", " to "]
s = "M14-19"
puts s.split(/^(m|f)(?=\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)})/i).drop(1)
#=> ["M", "14-19"]
See Ruby demo

How do I match a space or the end of a line in a regexp in Ruby?

I'm using Ruby 2.4. I'm tryihng to write a regular expression to match a string in which teh first character is either an "a" or a "b" and then next character is a space or the end of the line. So I came up with
2.4.0 :006 > data = "B U"
=> "B U"
2.4.0 :007 > data =~ /^[ab](^[[:space:]]|$)/i
=> nil
But as you can see, my expression is not matching my string "B U" even though I thought I wrote it properly. How can I revise it to make it right?
I'm tryihng to write a regular expression to match a string in which teh first character is either an "a" or a "b" and then next character is a space or the end of the line.
The regex in Ruby will look like
/^[ab](?:[[:space:]]|$)/i
See the regex demo.
Your ^[ab](^[[:space:]]|$) pattern matches the line start, then a or b, then either a whitespace at the start of the string (^[[:space:]], it will never match) or the line end ($). So, your regex will match a line that is equal to b or B.
Remember to replace ^ with \A and $ with \z if you need to match whole string, not just a line.

Why does my regexp not tell me if a string contains no numbers?

I'm using Ruby 2.4. How do I check if a string doesn't contain any numbers? I'm trying this
2.4.0 :002 > line = "abcdef"
=> "abcdef"
2.4.0 :007 > line =~ /^^[0-9]+$/
=> nil
I thought the "^" was the "not" character, but I'm not sure how it works because I know it is also the phrase starting character. Anyway, help is appreciated, -
^ negates a set of characters when at the beginning of the square-bracketed list:
[^abc] # not a, b, or c
so you just need to move it inside the brackets:
line =~ /^[^0-9]+$/
Note that you probably want \A and \z instead of ^ and $, since they match the starts and ends of entire strings instead of lines, and that \D is short for [^0-9].
line =~ /\A\D+\z/
You can also do a negative check for digits.
line !~ /\d/
You can use match? like this:
'abcdef'.match?(/\d/)
#=> false

Resources