Regular Expressions metacharacters, \< and \>, no longer supported in latest Ruby version? - ruby

When I investigated this in irb, I found that the metacharacters, \< and \>, returns nil when I expected a value. Under the cheatsheet I'm using, these metacharacters are called "start-of-word" and "end-of-word" respectively. But don't they function the same as "word boundaries"?
It seems to hold true for the examples in "Mastering Regular Expression" by Friedl.
irb(main):001:0> "this cat is fat" =~ /\bcat\b/
=> 5
irb(main):002:0> "this cat is fat" =~ /\<cat\>/
=> nil
irb(main):003:0> "cat" =~ /\<cat\>/
=> nil

That's entirely possible. As of Ruby 1.9, Ruby switched to Oniguruma for regular expression parsing. It's possible that prior to 1.9, \< and > were valid.
However, in researching this, I found that they are listed as a specifically GNU addition to the regex language.
Playing with it in Rubular, which supports running a regex through several different ruby implementations, I couldn't get \< or > to work in any version. \b seems to be the more standard way of specifying a word boundary...

Ruby has always used \b for word boundaries, just like Perl and the other Perl-derived flavors (JavaScript, .NET, Python, etc.). It's only GNU tools like egrep that use \< and \>.
Is that the AddedBytes cheat sheet you're using? It contains several errors, that being one of them. Elsewhere it says < and > are metacharacters and you have to use \< and \> to match them literally; how's that for a Catch-22?

Related

GSUB and Forward Slash usage in Ruby

I often see the gsub function being called with the pattern parameter enclosed in forward slashes. For example:
>> phrase = "*** and *** ran to the ###."
>> phrase.gsub(/\*\*\*/, "WOOF")
=> "WOOF and WOOF ran to the ###."
I thought maybe it had something to do with escaping asterisks, but using single quotes and double quotes works just as well:
>> phrase = "*** and *** ran to the ###."
>> phrase.gsub('***', "WOOF")
=> "WOOF and WOOF ran to the ###."
>> phrase.gsub("***", "WOOF")
=> "WOOF and WOOF ran to the ###."
Is it just convention to use forward slash? What am I missing?
Use forward slashes if you need to use regular expressions.
If you use a string argument with gsub, it will just do a plain character match.
In your example, you need backslashes to escape the asterisks when using a regular expression, because asterisks have a special meaning in regex (optionally match something any number of times). They are not necessary when using a string, because they are just matched exactly.
In your example, you probably don't need to use a regular expression, since it is a simple pattern. However, if you wanted to match *** only when it was at the beginning of a string (e.g. the first bunch in your example), then you would want to use a regex, for example:
phrase.gsub(/^\*{3}/, "WOOF")
For more information on regular expressions, see: http://www.regular-expressions.info/.
For more information on using regular expressions in Ruby, see: http://ruby-doc.org/core-2.2.0/Regexp.html.
To play with regular expressions as they work in Ruby, try: http://rubular.com/.
You are missing reading the documentation:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ā€˜dā€™, instead of a digit.
http://ruby-doc.org/core-2.1.4/String.html#method-i-gsub
In other words, you can give a string or a regular expression. Regular expressions can be delimited several ways:
Regexps are created using the /.../ and %r{...} literals, and by the Regexp::new constructor.
http://ruby-doc.org/core-2.2.2/Regexp.html
The benefit of %r and of the alternate %r delimiters is you can usually find a delimiter that doesn't collide with characters in the pattern, which would force escaping them, as in your example.
* has to be escaped because it has special meaning in a regex, but in a string it does not.

Regular expression "empty range in char class error"

I got a regex in my code, which is to match pattern of url and threw error:
/^(http|https):\/\/([\w-]+\.)+[\w-]+([\w- .\/?%&=]*)?$/
The error was "empty range in char class error". I found the cause of that is in ([\w- .\/?%&=]*)? part. Ruby seems to recognize - in \w- . as an operator for range instead of a literal -. After adding escape to the dash, the problem was solved.
But the original regular expression ran well on my co-workers' machines. We use the same version of osx, rails and ruby: Ruby version is ruby 1.9.3p194, rails is 3.1.6 and osx is 10.7.5. And after we deployed code to our Heroku server, everything worked fine too. Why did only my environment have error regarding this regex? What is the mechanism of Ruby regex interpreting?
I can replicate this error on Ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux], installed on Ubuntu 12.04.1 LTS using rvm 1.13.4. However, this should not be a version-specific error. In fact, I'm surprised it worked on the other machines at all.
A a simpler demonstration that fails just as well:
"abcd" =~ /[\w- ]/
This is because [\w- ] is interpreted as "a range beginning with any word character up to space (or blank)", rather than a character class containing a word, a hyphen, or a space, which is what you had intended.
Per Ruby's regular expression documentation:
Within a character class the hyphen (-) is a metacharacter denoting an inclusive range of characters. [abcd] is equivalent to [a-d]. A range can be followed by another range, so [abcdwxyz] is equivalent to [a-dw-z]. The order in which ranges or individual characters appear inside a character class is irrelevant.
As you saw, prepending a backslash escaped the hyphen, thus changing the nature of the regexp from a range to a character class, removing the error. However, escaping the hyphen in the middle of character class is not recommended, since it's easy to confuse the intended meaning of the hyphen in such cases. As m.buettner pointed out, always place hyphens either at the beginning or the end of a character class:
"abcd" =~ /[-\w ]/

Ruby Regex Different from Rubular

I'm trying the same super simple regex on Rubular.com and my VM Linux with Ruby 1.9.2 I dont' know why I'm getting different outputs:
VM:
my_str = "Madam Anita"
puts my_str[/\w/]
this Outputs: Madam
on Rubular it outputs: MadamAnita
Rubular:
http://www.rubular.com/r/qyQipItdes
I would love some help. I stuck here. I will not be able to test my code for the hw1.
No, it doesn't really. It matches all characters in "Madam" and "Anita", but not the space. The problem you are having is that my_str[/\w/] only returns a single match for the given regular expression, whereas Rubular highlights all possible matches.
If you need all occurrences, you could do this:
1.9.3p194 :002 > "Madam Anita".scan(/\w+/)
=> ["Madam", "", "Anita", ""]
Actually, \w matches a single character. The result in Rubular contains spaces between adjacent characters to tell you this (though I wish they'd also make the highlighting more obvious...). Compare with the output from matching \w+, which matches two strings (Madam and Anita).

What are Ruby's numbered global variables

What do the values $1, $2, $', $` mean in Ruby?
They're captures from the most recent pattern match (just as in Perl; Ruby initially lifted a lot of syntax from Perl, although it's largely gotten over it by now :). $1, $2, etc. refer to parenthesized captures within a regex: given /a(.)b(.)c/, $1 will be the character between a and b and $2 the character between b and c. $` and $' mean the strings before and after the string that matched the entire regex (which is itself in $&), respectively.
There is actually some sense to these, if only historically; you can find it in perldoc perlvar, which generally does a good job of documenting the intended mnemonics and history of Perl variables, and mostly still applies to the globals in Ruby. The numbered captures are replacements for the capture backreference regex syntax (\1, \2, etc.); Perl switched from the former to the latter somewhere in the 3.x versions, because using the backreference syntax outside of the regex complicated parsing too much. (By the time Perl 5 rolled around, the parser had been sufficiently rewritten that the syntax was again available, and promptly reused for references/"pointers". Ruby opted for using a name-quote : instead, which is closer to the Lisp and Smalltalk style; since Ruby started out as a Perl-alike with Smalltalk-style OO, this made more sense linguistically.) The same applies to $&, which in historical regex syntax is simply & (but you can't use that outside the replacement part of a substitution, so it became a variable $& instead). $` and $' are both "cutesy": "back-quote" and "forward-quote" from the matched string.
The non-numbered ones are listed here:
https://www.zenspider.com/ruby/quickref.html#pre-defined-variables
$1, $2 ... $N refer to matches in a regex capturing group.
So:
"ab:cd" =~ /([a-z]+):([a-z]+)/
Would yield
$1 = "ab"
$2 = "cd"

Ruby: hexadecimal in regular expressions

I need to match an md5 checksum in a regular expression in a Ruby (actually Rails) program. I found out somewhere that I can match hexadecimal strings with \h sequence, but I can't find the link anymore.
I'm using that sequence and my code is working in Ruby 1.9.2. I can make it working even under plain IRB (so it's not a Rails extension).
ruby-1.9.2-p180 :007 > "123abcdf" =~ /^\h+$/; $~
=> #<MatchData "123abcdf">
ruby-1.9.2-p180 :008 > "123abcdfg" =~ /^\h+$/; $~
=> nil
However my IDE mark that expression as wrong and I can't find any reference which cites that sequence.
Is the \h sequence legal in Ruby Regex under any environment/version or should I trust my ide and replace it with something like [abcdef\d]?
Yes it is. Check the official doc for the complete documentation for regex in Ruby.
Note that \h will match uppercase letters too, so it's actually equivalent to [a-fA-F\d]
According to this \h is part of oniguruma, which I believe is standard in ruby 1.9.

Resources