Is this the best way to unescape unicode escape sequences in Ruby? - ruby

I have some text that contains Unicode escape sequences like \u003C. This is what I came up with to unescape it:
string.gsub(/\u(....)/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
Is it correct? (i.e. it seems to work with my tests, but can someone more knowledgeable find a problem with it?)

Your regex, /\u(....)/, has some problems.
First of all, \u doesn't work the way you think it does, in 1.9 you'll get an error and in 1.8 it will just match a single u rather than the \u pair that you're looking for; you should use /\\u/ to find the literal \u that you want.
Secondly, your (....) group is much too permissive, that will allow any four characters through and that's not what you want. In 1.9, you want (\h{4}) (four hexadecimal digits) but in 1.8 you'd need ([\da-fA-F]{4}) as \h is a new thing.
So if you want your regex to work in both 1.8 and 1.9, you should use /\\u([\da-fA-F]{4})/. This gives you the following in 1.8 and 1.9:
>> s = 'Where is \u03bc pancakes \u03BD house? And u1123!'
=> "Where is \\u03bc pancakes \\u03BD house? And u1123!"
>> s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
=> "Where is μ pancakes ν house? And u1123!"
Using pack and unpack to mangle the hex number into a Unicode character is probably good enough but there may be better ways.

Related

How do I write regexes for German character classes like letters, vowels, and consonants?

For example, I set up these:
L = /[a-z,A-Z,ßäüöÄÖÜ]/
V = /[äöüÄÖÜaeiouAEIOU]/
K = /[ßb-zBZ&&[^#{V}]]/
So that /(#{K}#{V}{2})/ matches "ᄚ" in "azAZᄚ".
Are there any better ways of dealing with them?
Could I put those constants in a module in a file somewhere in my Ruby installation folder, so I can include/require them inside any new script I write on my computer? (I'm a newbie and I know I'm muddling this terminology; Please correct me.)
Furthermore, could I get just the meta-characters \L, \V, and \K (or whatever isn't already set in Ruby) to stand for them in regexes, so I don't have to do that string interpolation thing all the time?
You're starting pretty well, but you need to look through the Regexp class code that is installed by Ruby. There are tricks for writing patterns that build themselves using String interpolation. You write the bricks and let Ruby build the walls and house with normal String tricks, then turn the resulting strings into true Regexp instances for use in your code.
For instance:
LOWER_CASE_CHARS = 'a-z'
UPPER_CASE_CHARS = 'A-Z'
CHARS = LOWER_CASE_CHARS + UPPER_CASE_CHARS
DIGITS = '0-9'
CHARS_REGEX = /[#{ CHARS }]/
DIGITS_REGEX = /[#{ DIGITS }]/
WORDS = "#{ CHARS }#{ DIGITS }_"
WORDS_REGEX = /[#{ WORDS }]/
You keep building from small atomic characters and character classes and soon you'll have big regular expressions. Try pasting those one by one into IRB and you'll quickly get the hang of it.
A small improvement on what you do now would be to use regex unicode support for categories or scripts.
If you mean L to be any letter, use \p{L}. Or use \p{Latin} if you want it to mean any letter in a Latin script (all German letters are).
I don't think there are built-ins for vowels and consonants.
See \p{L} match your example.

ruby remove variable length string from regular expression leaving hyphen

I have a string such as this: "im# -33.870816,151.203654"
I want to extract the two numbers including the hyphen.
I tried this:
mystring = "im# -33.870816,151.203654"
/\D*(\-*\d+\.\d+),(\-*\d+\.\d+)/.match(mystring)
This gives me:
33.870816,151.203654
How do I get the hyphen?
I need to do this in ruby
Edit: I should clarify, the "im# " was just an example, there can be any set of characters before the numbers. the numbers are mostly well formed with the comma. I was having trouble with the hyphen (-)
Edit2: Note that the two nos are lattidue, longitude. That pattern is mostly fixed. However, in theory, the preceding string can be arbitrary. I don't expect it to have nos. or hyphen, but you never know.
How about this?
arr = "im# -33.2222,151.200".split(/[, ]/)[1..-1]
and arr is ["-33.2222", "151.200"], (using the split method).
now
arr[0].to_f is -33.2222 and arr[1].to_f is 151.2
EDIT: stripped "im#" part with [1..-1] as suggested in comments.
EDIT2: also, this work regardless of what the first characters are.
If you want to capture the two numbers with the hyphen you can use this regex:
> str = "im# -33.870816,151.203654"
> str.match(/([\d.,-]+)/).captures
=> ["33.870816,151.203654"]
Edit: now it captures hyphen.
This one captures each number separetely: http://rubular.com/r/NNP2OTEdiL
Note: Using String#scan will match all ocurrences of given pattern, in this case
> str.scan /\b\s?([-\d.]+)/
=> [["-33.870816"], ["151.203654"]] # Good, but flattened version is better
> str.scan(/\b\s?([-\d.]+)/).flatten
=> ["-33.870816", "151.203654"]
I recommend you playing around a little with Rubular. There's also some docs about regegular expressions with Ruby:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ
http://www.regular-expressions.info/ruby.html
http://www.ruby-doc.org/core-1.9.3/Regexp.html
Your regex doesn't work because the hyphen is caught by \D, so you have to modify it to catch only the right set of characters.
[^0-9-]* would be a good option.

How to read only English characters

I am reading a file that sometimes has Chinese and characters of languages other than English.
How can I write a regex that only reads English words/letters?
Should it just be /^[a-zA-Z]+/?
If I do the above, then words like "eété" will still be picked but I don't want that:
"été".match(/^[a-zA-Z]+/) => #nil good I didn't want that word
"eété".match(/^[a-zA-Z]+/) => #not nil tricked into picking something I did not want
The only truly English letter that comes to mind is wynn ƿ.
One could make an argument for eth ð and thorn þ, but it would be a much weaker argument than can be made for wynn.
Other than those, English typically uses the Latin alphabet, albeit with certain modifications. Wynn possibly excepted, there is no such thing as an English letter, only a Latin letter.
There certainly exist regular expressions that demand that base characters be in the Latin or Common scripts, such as for example
(?:[\p{Script=Latin}\p{Script=Common}]\pM*+)+
but as you haven’t specified whether you’re using a 7- or 8-bit version of Ruby or a 21-bit version, I’m not sure what to tell you.
You need $, which means end of line:
/^[a-zA-Z]+$/
Or if you use such filtration:
strings.select { |s| /^[a-zA-Z]+$/ =~ s }
# which is equal to strings.grep /^[a-zA-Z]+$/
you can use negative filtration method with slightly simplier regex:
strings.reject { |s| /[^a-zA-Z]/ =~ s }
where [^a-zA-Z] means any non-english character.
Sometimes it's useful to use the Iconv library to deal with non-ASCII:
require 'iconv'
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") # !> encoding option isn't portable: TRANSLIT//IGNORE
utf8_to_ascii_translit = Iconv.new("ASCII//TRANSLIT", "UTF8") # !> encoding option isn't portable: TRANSLIT
utf8_to_ascii_ignore = Iconv.new("ASCII//IGNORE", "UTF8") # !> encoding option isn't portable: IGNORE
resume = "Résumé"
utf8_to_latin1.iconv(resume) # => "R\xE9sum\xE9"
utf8_to_ascii_translit.iconv(resume) # => "R'esum'e"
utf8_to_ascii_ignore.iconv(resume) # => "Rsum"
Notice that Ruby is warning that the option choices are not portable. That means there might be some damage to the string being processed; The "//TRANSLIT" and "//IGNORE" options can degrade the string but for our purpose it's OK.
James Gray wrote a nice article about Encoding Conversion With iconv, which is useful for understanding what Iconv can do, along with dealing with UTF-8 and Unicode characters.

Remove all but some special characters

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

Convert non-breaking spaces to spaces in Ruby

I have cases where user-entered data from an html textarea or input is sometimes sent with \u00a0 (non-breaking spaces) instead of spaces when encoded as utf-8 json.
I believe that to be a bug in Firefox, as I know that the user isn't intentionally putting in non-breaking spaces instead of spaces.
There are also two bugs in Ruby, one of which can be used to combat the other.
For whatever reason \s doesn't match \u00a0.
However [^[:print:]], which definitely should not match) and \xC2\xA0 both will match, but I consider those to be less-than-ideal ways to deal with the issue.
Are there other recommendations for getting around this issue?
Use /\u00a0/ to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces.
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
See also: Ruby Regexp documentation
If you cannot use \s for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s, is absolutely required to match any Unicode whitespace code point.
There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.
If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.
Note that you cannot use a Unicode General Category property like \p{Zs} to stand for \p{Whitespace}. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.
Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)
Ruby 1.9
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text
s.each_codepoint {|c| print c, ' ' } #=> 32 160 32
s.strip.each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\u00A0/,'').strip.empty? #true
Ruby 1.8
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text # " \302\240 "
s.gsub(/\s+/,'') # "\302\240"
s.gsub(/\302\240/,'').strip.empty? #true
For whatever reason \s doesn't match \u00a0.
I think the "whatever reason" is that is not supposed to. Only the POSIX and \p construct character classes are Unicode aware. The character-class abbreviations are not:
Sequence As[...] Meaning
\d [0-9] ASCII decimal digit character
\D [^0-9] Any character except a digit
\h [0-9a-fA-F] Hexadecimal digit character
\H [^0-9a-fA-F] Any character except a hex digit
\s [ \t\r\n\f] ASCII whitespace character
\S [^ \t\r\n\f] Any character except whitespace
\w [A-Za-z0-9\_] ASCII word character
\W [^A-Za-z0-9\_] Any character except a word character
For the old versions of ruby (1.8.x), the fixes are the ones described in the question.
This is fixed in the newer versions of ruby 1.9+.
While not related to Ruby (and not directly to this question), the core of the problem might be that Alt+Space on Macs produces a non-breaking space.
This can cause all kinds of weird behaviour (especially in the terminal).
For those who are interested in more details, I wrote "Why chaining commands with pipes in Mac OS X does not always work" about this topic some time ago.

Resources