Ruby 1.9 unicode escapes in Regexp - ruby

I just upgraded an old project to Ruby 1.9.3. I'm having a bunch of trouble with unicode strings. It boils down to:
p = "\\username"; "Any String".match(/#{p}/)
That works in 1.8, and returns nil as expected. However, in 1.9 it throws:
ArgumentError: invalid Unicode escape
I'm trying to match '\u' in a string. I thought the two backslashes will escape it from registering as a unicode.
What am I missing here?
Edit: Single quotes don't work too:
1.9.3p429 :002 > p = '\\username'; "Any String".match(/#{p}/)
ArgumentError: invalid Unicode escape
from (irb):2

When you do /#{p}/ it means p will be interpreted as a regular expression. Since your p is now equal to \username, then this Regexp compilation will fail (since it IS an invalid Unicode escape sequence):
>> Regexp.new "\\username"
RegexpError: invalid Unicode escape: /\username/
I.e. doing /#{p}/ is equal to writing /\username/.
Therefore you have to escape p from any regular expressions so it will be interpreted correctly:
"Any String".match(/#{Regexp.escape(p)}/)
Or just:
"Any String".match(Regexp.escape(p))

Related

Ruby string prepend '\' character

why ruby is prepends '\' character while I am trying to run below code. It is happening with only '#$'
It is happening with all ruby version.
puts '#$' => '\#$'
or
'#$' => '\#$'
or
'mypassord#$123' => 'mypassord\#$123'
Please share you experience here. Is it a ruby problem or anything?
No it is not a ruby problem. It is your problem. Since #$foo can be interpreted as interpolation of the global variable $foo, it is necessary to escape the # character. That is why there is a backslash.
To be more precise, there is no possibility of interpolation with the string "#$" ($ is an invalid global variable) or "#$123" ($123 is an invalid global variable), but it makes the inspection algorithm or the interpolation algorithm complicated to check the sequence after #$, so I guess that is why # is escaped even in such cases.

Ruby + Imap: searching a subject with special characters doesn't work

When using Ruby + IMAP and trying to search a subject with special chars:
imap.uid_search(['SUBJECT', subject, 'NOT', 'SEEN'])
where subject is "Olá", it will fail with:
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
from /Users/fernando/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/net/imap.rb:1266:in `==='
Specifying the second parameter of uid_search, which is the charset, also doesn't work.
Subjects without special characters works fine. Is there a way to make this work?
This replicates the problem (with the same regexp that net/imap uses):
# encoding: ascii-8bit
a = /[\x80-\xff\r\n]/n
a =~ "olá".force_encoding('utf-8') # incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) (Encoding::CompatibilityError)
Two possibilities:
Add # encoding: ascii-8bit to the top of your script
Force the string's encoding over to ascii-8bit:
imap.uid_search(['SUBJECT', subject.force_encoding('ascii-8bit'), 'NOT', 'SEEN'])

Regular expression "empty range in char class error"

I got a regex in my code, which is to match pattern of url and threw error:
/^(http|https):\/\/([\w-]+\.)+[\w-]+([\w- .\/?%&=]*)?$/
The error was "empty range in char class error". I found the cause of that is in ([\w- .\/?%&=]*)? part. Ruby seems to recognize - in \w- . as an operator for range instead of a literal -. After adding escape to the dash, the problem was solved.
But the original regular expression ran well on my co-workers' machines. We use the same version of osx, rails and ruby: Ruby version is ruby 1.9.3p194, rails is 3.1.6 and osx is 10.7.5. And after we deployed code to our Heroku server, everything worked fine too. Why did only my environment have error regarding this regex? What is the mechanism of Ruby regex interpreting?
I can replicate this error on Ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux], installed on Ubuntu 12.04.1 LTS using rvm 1.13.4. However, this should not be a version-specific error. In fact, I'm surprised it worked on the other machines at all.
A a simpler demonstration that fails just as well:
"abcd" =~ /[\w- ]/
This is because [\w- ] is interpreted as "a range beginning with any word character up to space (or blank)", rather than a character class containing a word, a hyphen, or a space, which is what you had intended.
Per Ruby's regular expression documentation:
Within a character class the hyphen (-) is a metacharacter denoting an inclusive range of characters. [abcd] is equivalent to [a-d]. A range can be followed by another range, so [abcdwxyz] is equivalent to [a-dw-z]. The order in which ranges or individual characters appear inside a character class is irrelevant.
As you saw, prepending a backslash escaped the hyphen, thus changing the nature of the regexp from a range to a character class, removing the error. However, escaping the hyphen in the middle of character class is not recommended, since it's easy to confuse the intended meaning of the hyphen in such cases. As m.buettner pointed out, always place hyphens either at the beginning or the end of a character class:
"abcd" =~ /[-\w ]/

Ruby: hexadecimal in regular expressions

I need to match an md5 checksum in a regular expression in a Ruby (actually Rails) program. I found out somewhere that I can match hexadecimal strings with \h sequence, but I can't find the link anymore.
I'm using that sequence and my code is working in Ruby 1.9.2. I can make it working even under plain IRB (so it's not a Rails extension).
ruby-1.9.2-p180 :007 > "123abcdf" =~ /^\h+$/; $~
=> #<MatchData "123abcdf">
ruby-1.9.2-p180 :008 > "123abcdfg" =~ /^\h+$/; $~
=> nil
However my IDE mark that expression as wrong and I can't find any reference which cites that sequence.
Is the \h sequence legal in Ruby Regex under any environment/version or should I trust my ide and replace it with something like [abcdef\d]?
Yes it is. Check the official doc for the complete documentation for regex in Ruby.
Note that \h will match uppercase letters too, so it's actually equivalent to [a-fA-F\d]
According to this \h is part of oniguruma, which I believe is standard in ruby 1.9.

Ruby: how to check if an UTF-8 string contains only letters and numbers?

I have an UTF-8 string, which might be in any language.
How do I check, if it does not contain any non-alphanumeric characters?
I could not find such method in UnicodeUtils Ruby gem.
Examples:
ėččę91 - valid
$120D - invalid
You can use the POSIX notation for alpha-numerics:
#!/usr/bin/env ruby -w
# encoding: UTF-8
puts RUBY_VERSION
valid = "ėččę91"
invalid = "$120D"
puts valid[/[[:alnum:]]+/]
puts invalid[/[^[:alnum:]]+/]
Which outputs:
1.9.2
ėččę91
$
In ruby regex \p{L} means any letter (in any glyph)
so if s represents your string:
s.match /^[\p{L}\p{N}]+$/
This will filter out non numbers and letters.
The pattern for one alphanumeric code point is
/[\p{Alphabetic}\p{Number}]/
From there it’s easy to extrapolate something like this for has a negative:
/[^\p{Alphabetic}\p{Number}]/
or this for is all positive:
/^[\p{Alphabetic}\p{Number}]+$/
or sometimes this, depending:
/\A[\p{Alphabetic}\p{Number}]+\z/
Pick the one that best suits your needs.

Resources