Scanning for Unicode Numbers in a string with \d - ruby

According to the Oniguruma documentation, the \d character type matches:
decimal digit char
Unicode: General_Category -- Decimal_Number
However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:
#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')
puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…
p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
Am I misreading the documentation? Why doesn't \d match other Unicode numerals, and/or is there a way to make it do so?

Noted by Brian Candler on ruby-talk:
\w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
\d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.
The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:
\w word character
Not Unicode: alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.
This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.
p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]
It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)
Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:
[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."
Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc

Try the Unicode character class \p{N} instead. That matches all Unicode digits. No idea why \d isn't working.

\d will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive) (?u) syntax:
"𝟛".match(/(?u)\d/) # => #<MatchData "𝟛">
Alternatively, you can use "posix" or "unicode property" style in your regex, which don't require you to manually turn on Unicode matching:
/[[:digit:]]/ # posix style
/\p{Nd}/ # unicode property/category style
You can find more detailed information about how to do advanced matching for Unicode characters in Ruby in this blog post:
https://idiosyncratic-ruby.com/30-regex-with-class.html

Related

Difference in Unicode behaviour between `\w` vs `[[:alpha:]]` in Ruby

(For this question, ignore the number and underscore matching of \w, which is irrelevant to the discussion here.)
According to the Oniguruma docs, both the shorthand character classes like \w and POSIX classes like [:alpha:] have similar behaviour with regard to Unicode: they have simple ascii behaviour for "Not Unicode Case" (I assume that means the string's encoding is not a Unicode one), and a different behaviour that uses Unicode properties for "Unicode Case".
From that documentation, it sounds as in a case where one of those uses Unicode properties, the other will also use them. However, in practice they seem to differ: the POSIX classes use Unicode properties automatically, whereas the \w type classes have to be explicitly marked with ?u to use Unicode property based matching:
$ ruby -e 'print("~café.".encoding)'
UTF-8
$ ruby -e 'print(/[[:alpha:]]+/.match("~café."))'
café
$ ruby -e 'print(/\w+/.match("~café."))'
caf
$ ruby -e 'print(/(?u)\w+/.match("~café."))'
café
$ ruby -v
ruby 2.3.6p384
Is this a bug, or is my interpretation of the docs wrong? (And what exactly does ?u do, could someone link to where it is documented?)
Since version 2.0, Ruby uses Onigmo, an Oniguruma fork that supports more features implemented in Perl 5.10.
If you compare the doc you linked (Oniguruma) with Onigmo's doc you can see a difference between the \w descriptions:
Oniguruma:
\w word character
Not Unicode:
alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
Onigmo:
\w word character
Not Unicode:
alphanumeric and "_".
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
It depends on ONIG_OPTION_ASCII_RANGE option that non-ASCII char includes or not.
As you can see, there's no more this "and multibyte char." that doesn't make sense (at least for me) and that is probably a typo. Whatever, it's very unclear.
The u modifier switches the shorthand character classes from "Not Unicode" (default) to "Unicode".
That's why you obtain only caf without it and café with it when you try to match it using the character class \w.
On the other side the character class [[:alpha:]] seems to be already extended by default to unicode characters since it matches "café" without the u modifier. A start of explanation can be found in the doc:
It depends on ONIG_OPTION_ASCII_RANGE option and
ONIG_OPTION_POSIX_BRACKET_ALL_RANGE option that POSIX brackets
match non-ASCII char or not.
But you can force it to ascii using the (?a) modifier.

Ruby regex ‘backslash R’ aka ‘\R’ pattern

I am pretty sure I have seen “\R was introduced in Ruby2 to match newlines, despite where they came from: unix \n, macos \r or windows \r\n” somewhere. That said, Ruby2 should treat \R like %r{\r\n|\r|\n}.
This works fine:
▶ "a\nb".match /\R/
#⇒ #<MatchData "\n">
▶ "a\rb".match /\R/
#⇒ #<MatchData "\r">
▶ "a\r\nb".match /\R/
#⇒ #<MatchData "\r\n">
even whether line endings/feeds are combined:
▶ "a\r\n\nb".match /\R{2}/
#⇒ #<MatchData "\r\n\n">
unless one tries to negate \R:
▶ "a\nb".match /[^\R]+/
#⇒ #<MatchData "a\nb">
Negating \n works fine though:
▶ "a\nb".match /[^\n]+/
#⇒ #<MatchData "a">
Unfortunately, \R is enormously hard to google. Neither Regexp rdoc nor Regular Expressions have a mention of it.
Would any regex guru drop an explanation here, so that it was at least easily googled?
Thanks in advance.
This is from the author: https://github.com/k-takata/Onigmo/blob/master/doc/RE#L101. It says
\R Linebreak
Unicode:
(?>\x0D\x0A|[\x0A-\x0D\x{85}\x{2028}\x{2029}])
Not Unicode:
(?>\x0D\x0A|[\x0A-\x0D])
What seems relevant here to your question is that it is not a character group, but is a list of alternatives. Given that the sequence is not necessarily a single character, I guess it could not be made into a character group. This is probably interacting in peculiar way with negation, which is intended to be used only with characters and/or character groups.

Ruby 1.9 unicode escapes in Regexp

I just upgraded an old project to Ruby 1.9.3. I'm having a bunch of trouble with unicode strings. It boils down to:
p = "\\username"; "Any String".match(/#{p}/)
That works in 1.8, and returns nil as expected. However, in 1.9 it throws:
ArgumentError: invalid Unicode escape
I'm trying to match '\u' in a string. I thought the two backslashes will escape it from registering as a unicode.
What am I missing here?
Edit: Single quotes don't work too:
1.9.3p429 :002 > p = '\\username'; "Any String".match(/#{p}/)
ArgumentError: invalid Unicode escape
from (irb):2
When you do /#{p}/ it means p will be interpreted as a regular expression. Since your p is now equal to \username, then this Regexp compilation will fail (since it IS an invalid Unicode escape sequence):
>> Regexp.new "\\username"
RegexpError: invalid Unicode escape: /\username/
I.e. doing /#{p}/ is equal to writing /\username/.
Therefore you have to escape p from any regular expressions so it will be interpreted correctly:
"Any String".match(/#{Regexp.escape(p)}/)
Or just:
"Any String".match(Regexp.escape(p))

Ruby regex not ignoring whitespace

Matching S01E01 in Ruby
with
/S\d+?E\d+?/ix
which works, however, S01 E01 does not. I thought the /x should ignore white spaces?
/x ignores whitespace inside your regex, not in the text you're matching.
You're looking for
/S\d+?\s*E\d+?/i
The x option ignores whitespace in the regex itself, which allows you to better format your regexes for reading without modifying their meaning. You could write:
irb(main):008:0> r = /
irb(main):009:0/ S
irb(main):010:0/ d+?
irb(main):011:0/ E
irb(main):012:0/ d+?
irb(main):013:0/ /ix
To get an regex with the same meaning as your example.

Ruby: how to check if an UTF-8 string contains only letters and numbers?

I have an UTF-8 string, which might be in any language.
How do I check, if it does not contain any non-alphanumeric characters?
I could not find such method in UnicodeUtils Ruby gem.
Examples:
ėččę91 - valid
$120D - invalid
You can use the POSIX notation for alpha-numerics:
#!/usr/bin/env ruby -w
# encoding: UTF-8
puts RUBY_VERSION
valid = "ėččę91"
invalid = "$120D"
puts valid[/[[:alnum:]]+/]
puts invalid[/[^[:alnum:]]+/]
Which outputs:
1.9.2
ėččę91
$
In ruby regex \p{L} means any letter (in any glyph)
so if s represents your string:
s.match /^[\p{L}\p{N}]+$/
This will filter out non numbers and letters.
The pattern for one alphanumeric code point is
/[\p{Alphabetic}\p{Number}]/
From there it’s easy to extrapolate something like this for has a negative:
/[^\p{Alphabetic}\p{Number}]/
or this for is all positive:
/^[\p{Alphabetic}\p{Number}]+$/
or sometimes this, depending:
/\A[\p{Alphabetic}\p{Number}]+\z/
Pick the one that best suits your needs.

Resources