Regex slightly different in Ruby 2? - ruby

I just ported a small gem from Ruby 1.9.3 to the spiffy new Ruby 2.0.0. The only change I had to make was in a regular expression.
Under 1.9.3, the following regex would match any string containing characters other than digits, number-related punctuation, and whitespace (including non-breaking space).
/[^[[:space:]]\d\-,\.]/
Under 2.0.0, I had to move the Posix space class away from the start of the negation class.
/[^\d\-,\.[[:space:]]]/
I haven't found this change mentioned in the patch notes I've reviewed. Is it documented anywhere?

The regular expression engine has been changed to Onigmo (based on Oniguruma) and this might be causing issues.
As far as I can tell, you're declaring the regular expression incorrectly. The second set of brackets is not required:
/[^[:space:]\d\-,\.]/
The [:space:] declaration is only invalid inside of a set so you will see it appear as [[:space:]] if used in isolation. In your case you have several other additions to the set.
I'm not sure why \s would not have sufficed in this case.

Related

Weird thing in regex

When I was practice in rubular.com, I've be trying to match with a regular expression that checks if a word starts with a non-consonant. My approach it's check cases how that begins with a non-letter, or starts with a number or underscore, or checks the empty string
I've founded a strange behaviour:
My regex /^[aeiou_0-9\W]|^$/i match the k and s consonants!. I don't understand why.
Any ideas?
A link to example -> http://rubular.com/r/0zt0VPmcwr
This is very funny because you have stumbled across a bug specifically for just the letters k and s when using \W with /i (it's like a perfect storm).
Here is the link that explains the bug: https://bugs.ruby-lang.org/issues/4044
Perhaps this was patched in a later version of ruby, but if you don't feel like going through the hassle of going to a new version of ruby, then you can just explicitly make an inverted character class of all the consonants:
/^[^bcdfghjklmnpqrstvwxyz]|^$/i
Here is the rubular link: http://rubular.com/r/URgsWP3suQ
Edit:
So, something else I noticed about your regex is that your regex (and the regex I provided above) matches only the first letter of the words where as the regex that I provided matches the whole word. I don't know if this makes a difference for you, but I felt it was worth pointing out. Please see the difference in the highlighting in the rubular link above and the one below (See how the link above only highlights the first letter of the words where as the link below highlights the whole words):
^[^bcdfghjklmnpqrstvwxyz].*|^$
http://rubular.com/r/IVJ03uOK4h
It is a bug in Ruby regex in some versions. Select version 1.8.7 in the dropdown and you will see your regex works properly.
Edit. Check the docs at http://ruby-doc.org/core-2.1.5/Regexp.html. More specifically, in the metacharacters section:
/\W/ - A non-word character ([^a-zA-Z0-9_]). Please take a look at Bug #4044 if using /\W/ with the /i modifier.

Nested POSIX regular expression character class in Ruby?

How do I nest a POSIX-style character class inside another character class?
I'm trying to replace the matching of space or dash:
/[\s-]/
with
/[[[:space:]]-]/
And that isn't working. I'm using Ruby 1.9.3 and the official doc has no examples of nesting. I need the POSIX style because I'm working with UTF-8 and my examples are dumbed down from the actual expressions.
Thanks for any help!
Your third set of [] are not needed.
The [:space:] declaration is only valid inside of a set so you will see it appear as [[:space:]] if it is used by itself. In this case, you have more characters so the following will work.
[[:space:]-]

How to match characters from all languages, except the special characters in ruby

I have a display name field which I have to validate using Ruby regex. We have to match all language characters like French, Arabic, Chinese, German, Spanish in addition to English language characters except special characters like *()!##$%^&.... I am stuck on how to match those non-Latin characters.
There are two possibilities:
Create a regex with a negated character class containing every symbol you don't want to match:
if ( name ~= /[^*!#%\^]/ ) # add everything and if this matches you are good
This solution may not be feasible, since there is a massive amount of symbols you'd have to insert, even if you were just to include the most common ones.
Use Oniguruma (see also: Oniguruma for Ruby main). This supports Unicode and their properties; in which case all letters can be matched using:
if ( name ~= /[\pL\pM]/ )
You can see what these are all about here: Unicode Regular Expressions
Starting from Ruby 1.9, the String and Regex classes are unicode aware. You can safely use the Regex word character selector \w
"可口可樂!?!".gsub /\w/, 'Ha'
#=> "HaHaHaHa!?!"
In ruby > 1.9.1 (maybe earlier) one can use \p{L} to match word characters in all languages (without the oniguruma gem as described in a previous answer).

Convert non-breaking spaces to spaces in Ruby

I have cases where user-entered data from an html textarea or input is sometimes sent with \u00a0 (non-breaking spaces) instead of spaces when encoded as utf-8 json.
I believe that to be a bug in Firefox, as I know that the user isn't intentionally putting in non-breaking spaces instead of spaces.
There are also two bugs in Ruby, one of which can be used to combat the other.
For whatever reason \s doesn't match \u00a0.
However [^[:print:]], which definitely should not match) and \xC2\xA0 both will match, but I consider those to be less-than-ideal ways to deal with the issue.
Are there other recommendations for getting around this issue?
Use /\u00a0/ to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces.
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
See also: Ruby Regexp documentation
If you cannot use \s for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s, is absolutely required to match any Unicode whitespace code point.
There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.
If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.
Note that you cannot use a Unicode General Category property like \p{Zs} to stand for \p{Whitespace}. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.
Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)
Ruby 1.9
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text
s.each_codepoint {|c| print c, ' ' } #=> 32 160 32
s.strip.each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\u00A0/,'').strip.empty? #true
Ruby 1.8
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text # " \302\240 "
s.gsub(/\s+/,'') # "\302\240"
s.gsub(/\302\240/,'').strip.empty? #true
For whatever reason \s doesn't match \u00a0.
I think the "whatever reason" is that is not supposed to. Only the POSIX and \p construct character classes are Unicode aware. The character-class abbreviations are not:
Sequence As[...] Meaning
\d [0-9] ASCII decimal digit character
\D [^0-9] Any character except a digit
\h [0-9a-fA-F] Hexadecimal digit character
\H [^0-9a-fA-F] Any character except a hex digit
\s [ \t\r\n\f] ASCII whitespace character
\S [^ \t\r\n\f] Any character except whitespace
\w [A-Za-z0-9\_] ASCII word character
\W [^A-Za-z0-9\_] Any character except a word character
For the old versions of ruby (1.8.x), the fixes are the ones described in the question.
This is fixed in the newer versions of ruby 1.9+.
While not related to Ruby (and not directly to this question), the core of the problem might be that Alt+Space on Macs produces a non-breaking space.
This can cause all kinds of weird behaviour (especially in the terminal).
For those who are interested in more details, I wrote "Why chaining commands with pipes in Mac OS X does not always work" about this topic some time ago.

Converting regex statement for sentence extraction to Ruby

I found this regex statement at http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation for Sentence boundary disambiguation, but am not able to use it in a Ruby split statment. I'm not too good with regex so maybe I am missing something? This is statment:
((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
and this is what I tried in Ruby, but no go:
text.split("((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])")
This should work in Ruby 1.9, or in Ruby 1.8 if you compiled it with the Oniguruma regex engine (which is standard in Ruby 1.9):
result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/)
The difference is that your code passes a literal string to split(), while this code passes a literal regex.
It won't work using the classic Ruby regex engine (which is standard in Ruby 1.8) because it doesn't support lookbehind.
I also modified the regular expression. I replaced (\s|\r\n) with \s+. My regex also splits sentences that have multiple spaces between them (typing two spaces after a sentence is common in many cultures) and/or multiple line breaks between them (delimiting paragraphs).
When working with Unicode text, a further improvement would be to replace a-z with \p{Ll}\p{Lo}, A-Z with \p{Lu}\p{Lt}\p{Lo}, and 0-9 with \p{N} in the various character classes in your regex. The character class with punctuation symbols can be expaned similarly. That'll need a bit more research because there's no Unicode property for end-of-sentence punctuation.

Resources