How to match characters from all languages, except the special characters in ruby - ruby

I have a display name field which I have to validate using Ruby regex. We have to match all language characters like French, Arabic, Chinese, German, Spanish in addition to English language characters except special characters like *()!##$%^&.... I am stuck on how to match those non-Latin characters.

There are two possibilities:
Create a regex with a negated character class containing every symbol you don't want to match:
if ( name ~= /[^*!#%\^]/ ) # add everything and if this matches you are good
This solution may not be feasible, since there is a massive amount of symbols you'd have to insert, even if you were just to include the most common ones.
Use Oniguruma (see also: Oniguruma for Ruby main). This supports Unicode and their properties; in which case all letters can be matched using:
if ( name ~= /[\pL\pM]/ )
You can see what these are all about here: Unicode Regular Expressions

Starting from Ruby 1.9, the String and Regex classes are unicode aware. You can safely use the Regex word character selector \w
"可口可樂!?!".gsub /\w/, 'Ha'
#=> "HaHaHaHa!?!"

In ruby > 1.9.1 (maybe earlier) one can use \p{L} to match word characters in all languages (without the oniguruma gem as described in a previous answer).

Related

Custom regular expression i18n

I'm using Rails 3.2.
I'm localizing my site in Romanian. In regular expressions, the regexp interval [a-z] should contain, in order, the following letters:
a, ă, â, b, c etc.
Is there a way to tell my application that [a-z] should be the list above, based on my locale?
Also, there is an issue with capitalizing - "â".upcase doesn't result in "Â".
Or, maybe these features are not implemented yet in Rails?
This is not a rails issue, [a-z] is not required to include non-ascii characters. In ruby's case, [a-z] represents a regex range matching consecutive ascii letters.
In ruby, String.upcase is not required to be locale-dependent. Instead, you can try using UnicodeUtils gem like so:
% gem install unicode_utils
#encoding: UTF-8
require 'unicode_utils'
p UnicodeUtils.upcase('ă', :ro)
"Ă"
Specifying locale when converting string case makes more sense, because for example:
UnicodeUtils.upcase('i', :en) # is not equal to
UnicodeUtils.upcase('i', :tr)
I think [a-z] sequence is based on the ASCII code number, so Romanian characters will not be taken into consideration. If you want to match any Latin character, use the character property of Onigmo:
"ă" =~ /\p{Latin}/
# => 0

Ruby regexp handling of nbsp

In ruby 1.9.3 the regex engine doesn't treat nbsp's (\u00A0) as a space (\s). This is often a bummer for me.
So my question is, will this change in 2.0? If not, is there any way to monkey patch a solution?
Use Unicode properties (you need to declare a matching source code encoding for this to work):
# encoding=utf-8
if subject ~= /\p{Z}/
# subject contains whitespace or other separators
or use POSIX character classes:
if subject ~= /[[:space:]]/
According to the docs, \s will only match [ \t\r\n\f] now and in the future.
In Ruby, I recommend using the Unicode character class of "Space separators" \p{Zs}:
/\p{Zs}/u =~ "\xC2\xA0"
/\p{Zs}/u =~ "\u00A0"
/\p{Zs}/u =~ HTMLEntities.new.decode(' ')
See the Ruby-documentation for more Unicode character properties.
Note: Make sure, that your input-string is valid UTF-8 encoding. There are non-breaking spaces in other encodings too, e.g. "\xA0" in ISO-8859-1 (Latin1). More info on the "non-breaking space".
FYI: In most RegExp flavors and programming languages that support Unicode, character class \s usually includes all characters from the Unicode "separator" property \p{Z} (as mentioned by Tim Pietcker); However, Java and Ruby are popular exceptions here and \s only matches [ \t\r\n\f].

\w in Ruby Regular Expression matches Chinese characters

I use the code below:
puts "matched" if "中国" =~ /\w+/
it puts "matched" and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".
Could somebody give me some clues?
I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:
\w
Matches any word character. For
non-Unicode and ECMAScript
implementations, this is the same as
[a-zA-Z_0-9]. In Unicode categories,
this is the same as
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
So it's not the case that \w necessarily just means [a-zA-Z_0-9] - it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.
This still makes it different from . though, as \w wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.
As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc} does match, you can see on a Unicode reference list:
\p{Ll} Lowercase Unicode letter
\p{Lu} Uppercase Unicode letter
\p{Lt} Titlecase Unicode letter
\p{Lo} Other Unicode letter
\p{Nd} Decimal, number
\p{Pc} "Punctuation, connector"
Oniguruma, which is the regex engine in Ruby 1.9+, defines \w as:
[\w] word character
Not Unicode:
* alphanumeric, "_" and multibyte char.
Unicode:
* General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In 1.9+, Ruby knows if the string has Unicode characters, and automatically switches to use Unicode mode for pattern matching.

Convert non-breaking spaces to spaces in Ruby

I have cases where user-entered data from an html textarea or input is sometimes sent with \u00a0 (non-breaking spaces) instead of spaces when encoded as utf-8 json.
I believe that to be a bug in Firefox, as I know that the user isn't intentionally putting in non-breaking spaces instead of spaces.
There are also two bugs in Ruby, one of which can be used to combat the other.
For whatever reason \s doesn't match \u00a0.
However [^[:print:]], which definitely should not match) and \xC2\xA0 both will match, but I consider those to be less-than-ideal ways to deal with the issue.
Are there other recommendations for getting around this issue?
Use /\u00a0/ to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces.
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
See also: Ruby Regexp documentation
If you cannot use \s for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s, is absolutely required to match any Unicode whitespace code point.
There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.
If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.
Note that you cannot use a Unicode General Category property like \p{Zs} to stand for \p{Whitespace}. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.
Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)
Ruby 1.9
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text
s.each_codepoint {|c| print c, ' ' } #=> 32 160 32
s.strip.each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\u00A0/,'').strip.empty? #true
Ruby 1.8
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text # " \302\240 "
s.gsub(/\s+/,'') # "\302\240"
s.gsub(/\302\240/,'').strip.empty? #true
For whatever reason \s doesn't match \u00a0.
I think the "whatever reason" is that is not supposed to. Only the POSIX and \p construct character classes are Unicode aware. The character-class abbreviations are not:
Sequence As[...] Meaning
\d [0-9] ASCII decimal digit character
\D [^0-9] Any character except a digit
\h [0-9a-fA-F] Hexadecimal digit character
\H [^0-9a-fA-F] Any character except a hex digit
\s [ \t\r\n\f] ASCII whitespace character
\S [^ \t\r\n\f] Any character except whitespace
\w [A-Za-z0-9\_] ASCII word character
\W [^A-Za-z0-9\_] Any character except a word character
For the old versions of ruby (1.8.x), the fixes are the ones described in the question.
This is fixed in the newer versions of ruby 1.9+.
While not related to Ruby (and not directly to this question), the core of the problem might be that Alt+Space on Macs produces a non-breaking space.
This can cause all kinds of weird behaviour (especially in the terminal).
For those who are interested in more details, I wrote "Why chaining commands with pipes in Mac OS X does not always work" about this topic some time ago.

Converting regex statement for sentence extraction to Ruby

I found this regex statement at http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation for Sentence boundary disambiguation, but am not able to use it in a Ruby split statment. I'm not too good with regex so maybe I am missing something? This is statment:
((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
and this is what I tried in Ruby, but no go:
text.split("((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])")
This should work in Ruby 1.9, or in Ruby 1.8 if you compiled it with the Oniguruma regex engine (which is standard in Ruby 1.9):
result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/)
The difference is that your code passes a literal string to split(), while this code passes a literal regex.
It won't work using the classic Ruby regex engine (which is standard in Ruby 1.8) because it doesn't support lookbehind.
I also modified the regular expression. I replaced (\s|\r\n) with \s+. My regex also splits sentences that have multiple spaces between them (typing two spaces after a sentence is common in many cultures) and/or multiple line breaks between them (delimiting paragraphs).
When working with Unicode text, a further improvement would be to replace a-z with \p{Ll}\p{Lo}, A-Z with \p{Lu}\p{Lt}\p{Lo}, and 0-9 with \p{N} in the various character classes in your regex. The character class with punctuation symbols can be expaned similarly. That'll need a bit more research because there's no Unicode property for end-of-sentence punctuation.

Resources