Unicode & Ruby - Expected behavior? - ruby

Ruby Version: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]
Readline Version: 6.2
I'm working with some emojis and many of them behave correctly with the exception of 2. The 🌭 and 🍾 emojis. Here is some terminal output:
(byebug) "🌭"
"\u{1F32D}"
(byebug) "🛍"
"🛍"
(byebug) "🍾"
"\u{1F37E}"
Can someone tell me what's going on here? Is it just some encoding screwiness with irb? I might be snow-blind since I've been wrestling with this for so long so if there's any more information required to answer this please let me know.

Ruby may show a string with various backslash encodings for various reasons, one of which is irregular characters. For example:
"
"
# => "\n"
'"'
# => "\""
This doesn't mean the string contains an actual backslash, but rather that the version shown by inspect contains one. This is a long tradition dating back at least to the era of C in the 1970s where \n and such have been understood to mean "newline character".
In the case of emoji you might find that some are displayed and others aren't. This may be an interaction between the version of Ruby you're using and the terminal settings. As emoji are constantly being introduced you might find older ones display properly but Ruby's not confident enough with new ones to render them as-is, perhaps concerned that's an invalid Unicode character. Rather than showing something blank or the infamous question mark character, it shows the literal code for the character.

Related

Ruby 1.9.2 Unicode - Unicode Escaped Characters Being Dropped

I have a string in UTF-8 (according to the .encoding.name & .valid_encoding?) and there's an escaped unicode character in it (\u009A)
"Hammarskj\u009Ald"
This SHOULD print out as "Hammarskjšld", but it just drops the grapheme. EG:
puts "Hammarskj\u009Ald"
p "Hammarskj\u009Ald"
Results in the text:
Hammarskjld
"Hammarskj\u009Ald"
It also (if I save the data in the database) drops it when its save as well. I've searched for a while, but I can't quite figure out how to unescape it (which is what I THINK I need to do). A lot of the info out there is for 1.8.7, and some of the stuff for 1.9.2 isn't quite what I need.
Anyone have any idea on how to do what I want? I seem to have a valid UTF-8 string, that all I want to do is save in the database (intact), but it always drops the escaped unicode.
Are you sure it's dropped, and not just not displayed? Maybe it is just the problem of your font having a non-displaying zero-width character in that code point.
When you take it out of the database and p'ed or inspected, if you're seeing the escaped character, it means it's there, not dropped. It's your printing out that's the issue.

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

Convert non-breaking spaces to spaces in Ruby

I have cases where user-entered data from an html textarea or input is sometimes sent with \u00a0 (non-breaking spaces) instead of spaces when encoded as utf-8 json.
I believe that to be a bug in Firefox, as I know that the user isn't intentionally putting in non-breaking spaces instead of spaces.
There are also two bugs in Ruby, one of which can be used to combat the other.
For whatever reason \s doesn't match \u00a0.
However [^[:print:]], which definitely should not match) and \xC2\xA0 both will match, but I consider those to be less-than-ideal ways to deal with the issue.
Are there other recommendations for getting around this issue?
Use /\u00a0/ to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces.
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
See also: Ruby Regexp documentation
If you cannot use \s for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s, is absolutely required to match any Unicode whitespace code point.
There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.
If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.
Note that you cannot use a Unicode General Category property like \p{Zs} to stand for \p{Whitespace}. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.
Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)
Ruby 1.9
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text
s.each_codepoint {|c| print c, ' ' } #=> 32 160 32
s.strip.each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\u00A0/,'').strip.empty? #true
Ruby 1.8
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text # " \302\240 "
s.gsub(/\s+/,'') # "\302\240"
s.gsub(/\302\240/,'').strip.empty? #true
For whatever reason \s doesn't match \u00a0.
I think the "whatever reason" is that is not supposed to. Only the POSIX and \p construct character classes are Unicode aware. The character-class abbreviations are not:
Sequence As[...] Meaning
\d [0-9] ASCII decimal digit character
\D [^0-9] Any character except a digit
\h [0-9a-fA-F] Hexadecimal digit character
\H [^0-9a-fA-F] Any character except a hex digit
\s [ \t\r\n\f] ASCII whitespace character
\S [^ \t\r\n\f] Any character except whitespace
\w [A-Za-z0-9\_] ASCII word character
\W [^A-Za-z0-9\_] Any character except a word character
For the old versions of ruby (1.8.x), the fixes are the ones described in the question.
This is fixed in the newer versions of ruby 1.9+.
While not related to Ruby (and not directly to this question), the core of the problem might be that Alt+Space on Macs produces a non-breaking space.
This can cause all kinds of weird behaviour (especially in the terminal).
For those who are interested in more details, I wrote "Why chaining commands with pipes in Mac OS X does not always work" about this topic some time ago.

Whats up with the ÂŁ symbol

I'm a bit confused with the 'ÂŁ' symbol in Ruby.
In JRuby if I do :
puts 'ÂŁ40'
in a .rb file I run this, I get
£40
In JRuby IRB I get :
>> pung = 'h40'
=> "h40"
>> pung.gsub!('h', 'ÂŁ')
pung.gsub!('h', 'ÂŁ')
=> "\24340"
The pound symbol is output as \243.
In pure Ruby IRB, I cant even enter the ÂŁ symbol.. The cursor jumps to the left three spaces when I hit the ÂŁ key!
trying .toutf8 or toutf16 bring up even stranger characters!
Whats going on!??!? Why cant I just output a simple ÂŁ?
Sometimes this is a problem with the way your console pastes the character. For example, the unicode character sequence may include a character the console uses to do backspace or arrow left. This is probably the issue with the IRB console not receiving your character ok.
For the script, it looks like JRuby's doing what it's supposed to. The issue with the console should probably be reported as a bug, however, since we do want IRB to support entering unicode characters. Pop over to JRuby's bug tracker at http://bugs.jruby.org and provide show a simple session or provide steps to reproduce (which should be easy).
Most likely, the symbol is a Unicode symbol and you are converting it (perhaps unintentionally). If you can't enter the pound sterling symbol, make sure your console supports Unicode.
What do you get when you do ÂŁ.class ? String? Unicode::String? Perhaps explicitly declaring the character as a Unicode::String or Unicode::Character will give different results.
'\243' is the octal escape sequence for 'ÂŁ'.

What options do exist now to implement UTF8 in Ruby and RoR?

Following the development of Ruby very closely I learned that detailed character encoding is implemented in Ruby 1.9. My question for now is: How may Ruby be used at the moment to talk to a database that stores all data in UTF8?
Background: I am involved in a new project where Ruby/RoR is at least an option. But the project needs to rely on an internationalized character set (it's spread over many countries), preferably UTF8.
So how do you deal with that? Thanks in advance.
Ruby 1.8 works fine with UTF-8 strings for basic operations with the strings. Depending on your application's need, some operations will either not work or not work as expected.
Eg:
1) The size of strings will give you bytes, not characters since the mult-byte support is not there yet. But do you need to know the size of your strings in characters?
2) No splitting a string at a character boundary. But do you need this? Etc.
3) Sorting order will be funky if sorted in Ruby. The suggestion of using the db to sort is a good idea.
etc.
Re poster's comment about sorting data after reading from db: As noted, results will probably not match users' expectations. So the solution is to sort on the db. And it will usually be faster, anyhow--databases are designed to sort data.
Summary: My Ruby 1.8.6 RoR app works fine with international Unicode characters processed and stored as UTF-8 on modern browsers. Right to left languages work fine too. Main issues: be sure that your db and all web pages are set to use UTF-8. If you already have some data in your db, then you'll need to go through a conversion process to change it to UTF-8.
Regards,
Larry
"Unicode ahoy! While Rails has always been able to store and display unicode with no beef, it’s been a little more complicated to truncate, reverse, or get the exact length of a UTF-8 string. You needed to fool around with KCODE yourself and while plenty of people made it work, it wasn’t as plug’n’play easy as you could have hoped (or perhaps even expected).
So since Ruby won’t be multibyte-aware until this time next year, Rails 1.2 introduces ActiveSupport::Multibyte for working with Unicode strings. Call the chars method on your string to start working with characters instead of bytes." Click Here for more
Although I haven't tested it, the character-encodings library (currently in alpha) adds methods to the String class to handle UTF-8 and others. Its page on RubyForge is here. It is designed for Ruby 1.8.
It is my experience, however, that, using Ruby 1.8, if you store data in your database as UTF-8, Ruby will not get in the way as long as your character encoding in the HTTP header is UTF-8. It may not be able to operate on the strings, but it won't break anything. Example:
file.txt:
¡Hola! ¿Como estás? Leí el artículo. ¡Fue muy excellente!
Pardon my poor Spanish; it was the best example of Unicode I could come up with.
in irb:
str = File.read("file.txt")
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\n"
str += "Foo is equal to bar."
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
str = " " + str + " "
=> " \302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar. "
str.strip
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
Basically, it will just treat the UTF-8 as ASCII with odd characters in it. It will not sort lexigraphically if the code points are out of order; however, it will sort by code point. Example:
"\302" <=> "\301"
=> -1
How much are you planning on operating on the data in the Rails app, anyway? Most sorting etc. is usually done by your database engine.

Resources