Regex error in Ruby 1.8.7 but not 2.0? - ruby

In Ruby 1.8.7 the following regex warning: nested repeat operator + and * was replaced with '*'.
^(\w+\.\w+)\|(\w+\.\w+)\n+*$
It does work in Ruby 2.0 though?
http://rubular.com/r/nRUSP5LNZA

A nested operator works, but is warned because it is useless. \n+* means:
Zero or more repeatition of
One or more repeatition of
\n
which is equivalent to a more simple expression \n*, which means:
Zero or more repeatition of
\n
There is no reason to use \n+*. Ruby regex engine was replaced in Ruby 1.9 and in Ruby 2.0, and if there are any differences, then it is simply that the newer engine does not check for warnings as the older one did.

Related

[[:punct:]] matches differently in irb and rails test [duplicate]

This question already has an answer here:
Regex "punct" character class matches different characters depending on Ruby version
(1 answer)
Closed 5 years ago.
[[:punct:]] doesn't match any punctuation when it's called by a rails model test. Using the following code
test "punctuation matched by [[:punct:]]" do
punct_match = /\A[[:punct:]]+\Z/.match('[\]\[!"#$%&\'()*+,./:;<=>?#\^_`{|}~-]')
puts punct_match
puts punct_match.class
end
this outputs a non-printable character and NilClass.
However, if I execute the same statement
punct_match = /\A[[:punct:]]+\Z/.match('[\]\[!"#$%&\'()*+,./:;<=>?#\^_`{|}~-]')
in irb matches correctly and outputs
[\]\[!"#$%&'()*+,./:;<=>?#\^_`{|}~-]
=> nil
What am I missing?
Ruby version => 2.2.4,
Rails version => 4.2.6
The behaviour of /[[:punct:]]/ changed slightly in ruby version 2.4.0.
This bug was raised in the ruby issues, which links back to this (much older) issue in Onigmo - the regexp engine used since Ruby 2.0+.
The short answer is, these characters were not matched by /[[:punct:]]/ in ruby versions <2.4.0, and are now matched:
$+<=>^`|~
You must be running irb in a newer ruby version than this rails application, which is why it matches there.
On a separate note, you should alter your code slightly to:
/\A[[:punct:]]+\z/.match('[]!"#$%&\'()*+,./:;<=>?#^_`{|}~-]')
Use \z, not \Z. There is a slight difference: \Z will also match a new line at the end of the string.
You have unnecessary back-slashes in the string, such as '\^'
You have repeated a [ character: '[\]\['

Match unicode text with Ruby 1.8.7

I have a regex that is used for matching unicode string and works pretty cool with all versions of Ruby newer than 1.8.7:
/[\p{L}\p{Space}]+/u
How it can be achieved with Ruby 1.8.7?
Unicode properties were added in Ruby with version 1.9, so in older versions you have to use Posix classes like [:space:] or [:alpha:]
See POSIX Bracket Expressions for more details.

Regular expression "empty range in char class error"

I got a regex in my code, which is to match pattern of url and threw error:
/^(http|https):\/\/([\w-]+\.)+[\w-]+([\w- .\/?%&=]*)?$/
The error was "empty range in char class error". I found the cause of that is in ([\w- .\/?%&=]*)? part. Ruby seems to recognize - in \w- . as an operator for range instead of a literal -. After adding escape to the dash, the problem was solved.
But the original regular expression ran well on my co-workers' machines. We use the same version of osx, rails and ruby: Ruby version is ruby 1.9.3p194, rails is 3.1.6 and osx is 10.7.5. And after we deployed code to our Heroku server, everything worked fine too. Why did only my environment have error regarding this regex? What is the mechanism of Ruby regex interpreting?
I can replicate this error on Ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux], installed on Ubuntu 12.04.1 LTS using rvm 1.13.4. However, this should not be a version-specific error. In fact, I'm surprised it worked on the other machines at all.
A a simpler demonstration that fails just as well:
"abcd" =~ /[\w- ]/
This is because [\w- ] is interpreted as "a range beginning with any word character up to space (or blank)", rather than a character class containing a word, a hyphen, or a space, which is what you had intended.
Per Ruby's regular expression documentation:
Within a character class the hyphen (-) is a metacharacter denoting an inclusive range of characters. [abcd] is equivalent to [a-d]. A range can be followed by another range, so [abcdwxyz] is equivalent to [a-dw-z]. The order in which ranges or individual characters appear inside a character class is irrelevant.
As you saw, prepending a backslash escaped the hyphen, thus changing the nature of the regexp from a range to a character class, removing the error. However, escaping the hyphen in the middle of character class is not recommended, since it's easy to confuse the intended meaning of the hyphen in such cases. As m.buettner pointed out, always place hyphens either at the beginning or the end of a character class:
"abcd" =~ /[-\w ]/

Supporting Ruby 1.9's hash syntax in Ruby 1.8

I'm writing a Ruby gem using the {key: 'value'} syntax for hashes throughout my code. My tests all pass in 1.9.x, but I (understandably) get syntax error, unexpected ':', expecting ')' in 1.8.7.
Is there a best practice for supporting the 1.8.x? Do I need to rewrite the code using our old friend =>, or is there a better strategy?
I think you're out of luck, if you want to support 1.8 then you have to use =>. As usual, I will mention that you must use => in certain cases in 1.9:
If the key is not a symbol. Remember that any object (symbols, strings, classes, floats, ...) can be a key in a Ruby Hash.
If you need a symbol that you'd quote: :'this.that'.
If you use MongoDB for pretty much anything you'll be using things like :$set => hash but $set: hash is a syntax error.
Back to our regularly scheduled programming.
Why do I say that you're out of luck? The Hash literal syntaxes (both of them) are hard-wired in the parser and I don't think you're going to have much luck patching the parser from your gem. Ruby 1.8.7's parse.y has this to say:
assoc : arg_value tASSOC arg_value
{
$$ = list_append(NEW_LIST($1), $3);
}
;
and tASSOC is => so hash literals are hard-wired to use =>. 1.9.3's says this:
assoc : arg_value tASSOC arg_value
{
/*%%%*/
$$ = list_append(NEW_LIST($1), $3);
/*%
$$ = dispatch2(assoc_new, $1, $3);
%*/
}
| tLABEL arg_value
{
/*%%%*/
$$ = list_append(NEW_LIST(NEW_LIT(ID2SYM($1))), $2);
/*%
$$ = dispatch2(assoc_new, $1, $2);
%*/
}
;
We have the fat-arrow syntax again (arg_value tASSOC arg_value) and the JavaScript style (tLABEL arg_value); AFAIK, tLABEL is also the source of the restrictions on what sorts of symbols (no :$set, no :'this.that', ...) can be used with the JavaScript-style syntax. The current trunk parse.y matches 1.9.3 for Hash literals.
So the Hash literal syntax is hard-wired into the parser and you're stuck with fat arrows if you want to support 1.8.
Ruby 1.8.7 does not support the new hash syntax.
If you desperately need hash syntax on the non-YARV c-based implementation of Ruby, there is a totally unsupported 1.8 head branch, so you can do
rvm install ruby-head --branch ruby_1_8 ; rvm ruby-head
ruby -v
ruby 1.8.8dev (2011-05-25) [i386-darwin10.7.0]
but upgrading to 1.9 is the way to go.

Why does "?b" mean 'b' in Ruby?

"foo"[0] = ?b # "boo"
I was looking at the above example and trying to figure out:
How "?b" implies the character 'b'?
Why is it necessary? - Couldn't I just write this:
"foo"[0] = 'b' # "boo"
Ed Swangren: ? returns the character code of a
character.
Not in Ruby 1.9. As of 1.9, ?a returns 'a'. See here: Son of 10 things to be aware of in Ruby 1.9!
telemachus ~ $ ~/bin/ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [i686-linux]
telemachus ~ $ ~/bin/ruby -e 'char = ?a; puts char'
a
telemachus ~ $ /usr/bin/ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
telemachus ~ $ /usr/bin/ruby -e 'char = ?a; puts char'
97
Edit: A very full description of changes in Ruby 1.9.
Another edit: note that you can now use 'a'.ord if you want the string to number conversion you get in 1.8 via ?a.
The change is related to Ruby 1.9's UTF-8 updates.
The Ruby 1.8 version of ? only worked with single-byte characters. In 1.9, they updated everything to work with multi-byte characters. The trouble is, it's not clear what integer should return from ?€.
They solved it by changing what it returns. In 1.9, all of the following are single-element strings and are equivalent:
?€
'€'
"€"
"\u20AC"
?\u20AC
They should have dropped the notation, IMO, rather than (somewhat randomly) changing the behavior. It's not even officially deprecated, though.
? returns the character code of a character. Here is a relevant post on this.
In some languages (Pascal, Python), chars don't exist: they're just length-1 strings.
In other languages (C, Lisp), chars exist and have distinct syntax, like 'x' or #\x.
Ruby has mostly been on the side of "chars don't exist", but at times has seemed to not be entirely sure of this choice. If you do want chars as a data type, Ruby already assigns meaning to '' and "", so ?x seems about as reasonable as any other option for char literals.
To me, it's simply a matter of saying what you mean. You could just as well say foo[0]=98, but you're using an integer when you really mean a character. Using a string when you mean a character looks equally strange to me: the set of operations they support is almost completely different. One is a sequence of the other. You wouldn't make Math.sqrt take a list of numbers, and just happen to only look at the first one. You wouldn't omit "integer" from a language just because you already support "list of integer".
(Actually, Lisp 1.0 did just that -- Church numerals for everything! -- but performance was abysmal, so this was one of the huge advances of Lisp 1.5 that made it usable as a real language, back in 1962.)

Resources