\w in Ruby Regular Expression matches Chinese characters - ruby

I use the code below:
puts "matched" if "中国" =~ /\w+/
it puts "matched" and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".
Could somebody give me some clues?

I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:
\w
Matches any word character. For
non-Unicode and ECMAScript
implementations, this is the same as
[a-zA-Z_0-9]. In Unicode categories,
this is the same as
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
So it's not the case that \w necessarily just means [a-zA-Z_0-9] - it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.
This still makes it different from . though, as \w wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.
As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc} does match, you can see on a Unicode reference list:
\p{Ll} Lowercase Unicode letter
\p{Lu} Uppercase Unicode letter
\p{Lt} Titlecase Unicode letter
\p{Lo} Other Unicode letter
\p{Nd} Decimal, number
\p{Pc} "Punctuation, connector"

Oniguruma, which is the regex engine in Ruby 1.9+, defines \w as:
[\w] word character
Not Unicode:
* alphanumeric, "_" and multibyte char.
Unicode:
* General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In 1.9+, Ruby knows if the string has Unicode characters, and automatically switches to use Unicode mode for pattern matching.

Related

Regular expression for letters, spaces and hyphens

Looking for the regex to allow letters (either case), spaces and dashes for validation in ruby. Can't quite crack it.
As a starting point I'm using:
validates :name, format: { with: /\A[a-zA-Z]+(?: [a-zA-Z]+)?\z/, allow_blank: true}
Many thanks!
If you need to support all Unicode letters and to make sure - and spaces only appear between letters and no consecutive spaces/hyphens may occur (and there may be any amount of spaces/hyphens), use
/\A\p{L}+(?:[- ]\p{L}+)*\z/
/\A\p{L}+(?:[-\s]\p{L}+)*\z/
/\A\p{L}+(?:[-\p{Zs}\t]\p{L}+)*\z/
In short,
\A - matches start of string
\p{L}+ - one or more letters
(?:[-\s]\p{L}+)* - a non-capturing group that matches zero or more occurrences of
[-\s] - a - or whitespace
\p{L}+ - one or more Unicode letters
\z - the end of string.
See the regex demo.
In the comments, you mention that /\A[-A-Z\s]+\z/i works for you, but it also matches blank strings, or strings that are a mix of hyphens and whitespace because it means "start of string, one or more ASCII letters, whitespace or hyphens and then the end of string". This can be used to allow only specific chars to be input, but this does not validate much.
This regex will allow letters, spaces and hyphens: /\A[A-Za-z\s\-]+\z/

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

Ruby regexp handling of nbsp

In ruby 1.9.3 the regex engine doesn't treat nbsp's (\u00A0) as a space (\s). This is often a bummer for me.
So my question is, will this change in 2.0? If not, is there any way to monkey patch a solution?
Use Unicode properties (you need to declare a matching source code encoding for this to work):
# encoding=utf-8
if subject ~= /\p{Z}/
# subject contains whitespace or other separators
or use POSIX character classes:
if subject ~= /[[:space:]]/
According to the docs, \s will only match [ \t\r\n\f] now and in the future.
In Ruby, I recommend using the Unicode character class of "Space separators" \p{Zs}:
/\p{Zs}/u =~ "\xC2\xA0"
/\p{Zs}/u =~ "\u00A0"
/\p{Zs}/u =~ HTMLEntities.new.decode(' ')
See the Ruby-documentation for more Unicode character properties.
Note: Make sure, that your input-string is valid UTF-8 encoding. There are non-breaking spaces in other encodings too, e.g. "\xA0" in ISO-8859-1 (Latin1). More info on the "non-breaking space".
FYI: In most RegExp flavors and programming languages that support Unicode, character class \s usually includes all characters from the Unicode "separator" property \p{Z} (as mentioned by Tim Pietcker); However, Java and Ruby are popular exceptions here and \s only matches [ \t\r\n\f].

How to match characters from all languages, except the special characters in ruby

I have a display name field which I have to validate using Ruby regex. We have to match all language characters like French, Arabic, Chinese, German, Spanish in addition to English language characters except special characters like *()!##$%^&.... I am stuck on how to match those non-Latin characters.
There are two possibilities:
Create a regex with a negated character class containing every symbol you don't want to match:
if ( name ~= /[^*!#%\^]/ ) # add everything and if this matches you are good
This solution may not be feasible, since there is a massive amount of symbols you'd have to insert, even if you were just to include the most common ones.
Use Oniguruma (see also: Oniguruma for Ruby main). This supports Unicode and their properties; in which case all letters can be matched using:
if ( name ~= /[\pL\pM]/ )
You can see what these are all about here: Unicode Regular Expressions
Starting from Ruby 1.9, the String and Regex classes are unicode aware. You can safely use the Regex word character selector \w
"可口可樂!?!".gsub /\w/, 'Ha'
#=> "HaHaHaHa!?!"
In ruby > 1.9.1 (maybe earlier) one can use \p{L} to match word characters in all languages (without the oniguruma gem as described in a previous answer).

Regex to validate strings having only characters (without special characters but with accented characters), blank spaces and numbers

I am using Ruby on Rails 3.0.9 and I would like to validate a string that can contain only characters (case insensitive characters), blank spaces and numbers.
More:
special characters are not allowed (eg: !"£$%&/()=?^) except - and _;
accented characters are allowed (eg: à, è, é, ò, ...);
The regex that I know from this question is ^[a-zA-Z\d\s]*$ but this do not validate special characters and accented characters.
So, how I should improve the regex?
I wrote the ^(?:[^\W_]|\s)*$ answer in the question you referred to (which actually would have been different if I'd known you wanted to allow _ and -). Not being a Ruby guy myself, I didn't realize that Ruby defaults to not using Unicode for regex matching.
Sorry for my lack of Ruby experience. What you want to do is use the u flag. That switches to Unicode (UTF-8), so accented characters are caught. Here's the pattern you want:
^[\w\s-]*$
And here it is in action at Rubular. This should do the trick, I think.
The u flag works on my original answer as well, though that one isn't meant to allow _ or - characters.
Something like ^[\w\s\-]*$ should validate characters, blank spaces, minus, and underscore.
Validation string only for not allowed characters. In this case |,<,>," and &.
^[^|<>\"&]*$

Resources