What options do exist now to implement UTF8 in Ruby and RoR? - ruby

Following the development of Ruby very closely I learned that detailed character encoding is implemented in Ruby 1.9. My question for now is: How may Ruby be used at the moment to talk to a database that stores all data in UTF8?
Background: I am involved in a new project where Ruby/RoR is at least an option. But the project needs to rely on an internationalized character set (it's spread over many countries), preferably UTF8.
So how do you deal with that? Thanks in advance.

Ruby 1.8 works fine with UTF-8 strings for basic operations with the strings. Depending on your application's need, some operations will either not work or not work as expected.
Eg:
1) The size of strings will give you bytes, not characters since the mult-byte support is not there yet. But do you need to know the size of your strings in characters?
2) No splitting a string at a character boundary. But do you need this? Etc.
3) Sorting order will be funky if sorted in Ruby. The suggestion of using the db to sort is a good idea.
etc.
Re poster's comment about sorting data after reading from db: As noted, results will probably not match users' expectations. So the solution is to sort on the db. And it will usually be faster, anyhow--databases are designed to sort data.
Summary: My Ruby 1.8.6 RoR app works fine with international Unicode characters processed and stored as UTF-8 on modern browsers. Right to left languages work fine too. Main issues: be sure that your db and all web pages are set to use UTF-8. If you already have some data in your db, then you'll need to go through a conversion process to change it to UTF-8.
Regards,
Larry

"Unicode ahoy! While Rails has always been able to store and display unicode with no beef, it’s been a little more complicated to truncate, reverse, or get the exact length of a UTF-8 string. You needed to fool around with KCODE yourself and while plenty of people made it work, it wasn’t as plug’n’play easy as you could have hoped (or perhaps even expected).
So since Ruby won’t be multibyte-aware until this time next year, Rails 1.2 introduces ActiveSupport::Multibyte for working with Unicode strings. Call the chars method on your string to start working with characters instead of bytes." Click Here for more

Although I haven't tested it, the character-encodings library (currently in alpha) adds methods to the String class to handle UTF-8 and others. Its page on RubyForge is here. It is designed for Ruby 1.8.
It is my experience, however, that, using Ruby 1.8, if you store data in your database as UTF-8, Ruby will not get in the way as long as your character encoding in the HTTP header is UTF-8. It may not be able to operate on the strings, but it won't break anything. Example:
file.txt:
¡Hola! ¿Como estás? Leí el artículo. ¡Fue muy excellente!
Pardon my poor Spanish; it was the best example of Unicode I could come up with.
in irb:
str = File.read("file.txt")
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\n"
str += "Foo is equal to bar."
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
str = " " + str + " "
=> " \302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar. "
str.strip
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
Basically, it will just treat the UTF-8 as ASCII with odd characters in it. It will not sort lexigraphically if the code points are out of order; however, it will sort by code point. Example:
"\302" <=> "\301"
=> -1
How much are you planning on operating on the data in the Rails app, anyway? Most sorting etc. is usually done by your database engine.

Related

Oracle PL/SQL SQL Injection Test from Unicode to Windows-1252

I have a DB using windows-1252 character encoding and dynamic SQL that does simple single quote escaping like this...
l_str := REPLACE(TRIM(someUserInput),'''','''''');
Because the DB is windows-1252 when the notorious Unicode Character 'MODIFIER LETTER APOSTROPHE' (U+02BC) is sent it gets converted.
Example: The front end app submits this...
TESTʼEND
But ends up searching on this...
and someColumn like '%TESTʼEND%'
What I want to know is, since the ʼ was converted into ʼ (which luckily is safe just yields wrong search results) is there any scenario where a non-windows-1252 characters can be converted into something that WILL break this thus making SQL injection possible?
I know about bind variables, I know the DB should be unicode as well, that's not what I'm asking here. I am needing proof that what you see above is not safe. I have searched for days and cannot find a way to cause SQL injection when doing simple single quote escaping like this when the DB is windows-1252. Thanks!
Oh, and always assuming the column being search is a varchar, not number. I am aware of the issues and how things change when dealing with numbers. So assume this is always the case:
l_str := REPLACE(TRIM(someUserInput),'''','''''');
...
... and someVarcharColumn like '%'||l_str||'%'
Putting the argument of using bind variables aside, since you said you wanted proof that it could break without bind variables.
Here's what's going on in your example -
The Unicode character 'MODIFIER LETTER APOSTROPHE' (U+02BC) in UTF-8 is made up of 2 bytes - 0xCA 0xBC.
Of that 0xCA is 'LATIN CAPITAL LETTER E WITH CIRCUMFLEX' which looks like - Ê
and 0xBC is 'VULGAR FRACTION ONE QUARTER' which looks like ¼.
This happens because your client probably uses an encoding that supports multi-byte characters but your DB doesn't. You would want to make sure that the encoding in both database and client is the same to avoid these issues.
Coming back to the question - is it possible that dynamic SQL without bind variables can be injected into because of these special unicode characters - The answer is probably yes.
All you need to break that dynamic sql using this encoding difference is a multibyte character, one of whose bytes is 0x27 which is an apostrophe.
I said 'probably' because a quick search on fileformat.info for 0x27 didn't give me anything back. Not sure if I'm using that site right. However that doesn't mean that it isn't possible, maybe a different client could use a different encoding.
I would recommend to never use dynamic SQL where input parameter values are used without bind variables, irrespective of whatever encoding you choose. You're just setting yourself up for so many problems going forward, apart from the performance penalty you have to pay to do a hard parse every single time.
Edit: And of course, most importantly, there is nothing stopping your client to send an actual apostrophe instead of the unicode multibyte character and that would be your definitive proof that the SQL is not safe and can be injected into.
Edit2: I missed your first part where you replace one apostrophe with 2. That should technically take care of the multibyte characters too. I'd still be against this approach.
Your problem is not about SQL Injection, the problem is the character set of your front end app.
Your front end app sends the text in UTF-8, however the database "thinks" it is a Windows-1252 string.
Set your client NLS_LANG value to AMERICAN_AMERICA.AL32UTF8 (you may choose a different territory and/or language), then it should look better.
Then your front end app sends the string in UTF-8 and the database recognize it as UTF-8. It will be converted to Windows-1252 internally. I case you enter a string which is not supported by CP1252 (e.g. Cyrillic Capital Letter Ж) it will end up to something like Cyrillic Capital Letter ¿ - which should be fine in terms of SQL injection.
See this answer to get more information about database and client character sets.

Ruby, problems comparing strings with UTF-8 characters

I have these 2 UTF-8 strings:
a = "N\u01b0\u0303"
b = "N\u1eef"
They look pretty different but the are the same once they are rendered:
irb(main):039:0> puts "#{a} - #{b}"
Nữ - Nữ
The a version is the one I have stored in the DB. The b version is the one is coming from the browser in a POST request, I don't know why the browser is sending a different combination of UTF8 characters, and it is not happening always, I can't reproduce the issue in my dev environment, it happens in production and in a percentage of the total requests.
The case is that I try to compare both of them but they return false:
irb(main):035:0> a == b
=> false
I've tried different things like forcing encoding:
irb(main):022:0> c.force_encoding("UTF-8") == a.force_encoding("UTF-8")
=> false
Another interesting fact is:
irb(main):005:0> a.chars
=> ["N", "ư", "̃"]
irb(main):006:0> b.chars
=> ["N", "ữ"]
How can I compare these kind of strings?
This is an issue with Unicode equivalence.
The a version of your string consists of the character ư (U+01B0: LATIN SMALL LETTER U WITH HORN), followed by U+0303 COMBINING TILDE. This second character, as the name suggests is a combining character, which when rendered is combined with the previous character to produce the final glyph.
The b version of the string uses the character ữ (U+1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE) which is a single character, and is equivalent to the previous combination, but uses a different byte sequence to represent it.
In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters. Current versions of Ruby have this built in (in earlier versions you needed to use a third party library).
So currently you have
a == b
which is false, but if you do
a.unicode_normalize == b.unicode_normalize
you should get true.
If you are on an older version of Ruby, there are a couple of options. Rails has a normalize method as part of its multibyte support, so if you are using Rails you can do:
a.mb_chars.normalize == b.mb_chars.normalize
or perhaps something like:
ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)
If you’re not using Rails, then you could look at the unicode_utils gem, and do something like this:
UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)
(nfkc refers to the normalisation form, it is the same as the default in the other techniques.)
There are various different ways to normalise unicode strings (i.e. whether you use the decomposed or combined versions), and this example just uses the default. I’ll leave researching the differences to you.
You can see these are distinct characters. First and second. In the first case, it is using a modifier "combining tilde".
Wikipedia has a section on this:
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.
and
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
It seems that Ruby supports this normalization, but only as of Ruby 2.2:
http://ruby-doc.org/stdlib-2.2.0/libdoc/unicode_normalize/rdoc/String.html
a = "N\u01b0\u0303".unicode_normalize
b = "N\u1eef".unicode_normalize
a == b # true
Alternatively, if you are using Ruby on Rails, there appears to be a built-in method for normalization.

Any ruby gems to do (Chinese) Transliterate (Romanization), especially for URL?

Generally spoken, it takes Unicode text and tries to represent it in
US-ASCII characters (universally displayable, unaccented characters)
by attempting to transliterate the pronunciation expressed by the text
in some other writing system to Roman letters.
ex,
"一二三".ooxx => "e-er-san"
After doing http://rubygems.org/search?utf8=%E2%9C%93&query=pinyin I got some rubygems, but none of them are robustly workable for this issue.
Doing this perfectly is almost impossible, since some Chinese characters have two or more pronunciations, for example 银行 = yin hang, 不行 = bu xing (the last character is identical, pronounced hang in one context and xing in the other)... Other than that, you could probably roll your own using the unicode database, which I think has pronunciation info as well. If you want to be more fancy, I think there are some open source input methods which have the mappings, and they'll have them for words too, so that if you find 银行 together, it will know that the second character is hang, not xing. OpenVanilla might have databases you can work with (OSS).

Split utf8 string regardless of ruby version

str = "é-du-Marché"
I get the first char via
str.split(//).first
How I can get the rest of the string regardless of my ruby version ?
String does not have a method first. So you need in addition a split. When you do the split in unicode-mode (exactly utf-8) you have acces to the first (and other characters).
My solution:
puts RUBY_VERSION
str = "é-du-Marché"
p str.split(//u, 2)
Test with ruby 1.9.2:
1.9.2
["\u00E9", "-du-March\u00E9"]
Test with ruby 1.8.6:
1.8.6
["\303\251", "-du-March\303\251"]
With first and last you get your results:
str.split(//u, 2).first is the first character
str.split(//u, 2).last is the string after the first character.
str[1..-1] should return you everything after the first digit normally.
The first number is the starting index, which is set to 1 to skip the first digit, the second is the length, which is set to -1, so ruby counts from the back
Note: that multibyte characters only work in Ruby 1.9. If you wish to mimic this behavior downwards, you'll have to loop over the bytes yourself and figure out what needs to be removed from the data, cause Ruby 1.8 does not support this.
UPDATE:
You could try this as well, but I can't guarantee that it will work for every multibyte char:
str = "é-du-Marché"
substring = str.mb_chars[1..-1]
the mb_chars is a proxy class that directs the call to the appropiate implementation when dealing with UTF-8, UTF-32 or UTF-16 encoding of characters (e.g. multibyte chars).
More detailed info can be found here : http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html
But I do not know if this exists in older rails versions
UPDATE2:
Ruby 1.8 treats any string just as a bunch of bytes, calling size() on it will return the amount of bytes that is used to store the data. To determine the characters regardless of the encoding try this:
char_array = str.scan(/./m)
substring = char_array[1..-1].join
This should do the trick normally. Try looking at http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 who explains how to treat multibyte data in older ruby versions.
EDIT3:
Playing around with the scan & join operations brings me closer to your problem & solution. I honestly don't have the time at to get the full solution working but if you play with the scan(/./mu) options you convert it to utf-8, which is supported by all ruby versions.

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

Resources