Why does Ruby Integer method 'chr' use ASCII-8bit, not UTF-8 by default? - ruby

According to it's source code https://www.rubydoc.info/stdlib/core/Integer:chr, this method uses ASCII encoding if no arguments provided, and really, it gives different results when called with and without arguments:
irb(main):002:0* 255.chr
=> "\xFF"
irb(main):003:0' 255.chr 'utf-8'
=> "ÿ"
Why does this happen? Isn't Ruby supposed to use UTF-8 everywhere by default? At least all strings seem to be encoded with UTF-8:
irb(main):005:0> "".encoding
=> #<Encoding:UTF-8>

Why does this happen?
For characters from U+0000 to U+007F (127), the vast majority of single-octet and variable-length character encodings agree on the encoding. In particular, they all agree on being strict supersets of ASCII.
In other words: for characters up to and including U+007F, ASCII, the entire ISO8859 family, the entire DOS codepage family, the entire Windows family, as well as UTF-8 are actually identical. So, for characters between U+0000 and U+007F, ASCII is the logical choice:
0.chr.encoding
#=> #<Encoding:US-ASCII>
127.chr.encoding
#=> #<Encoding:US-ASCII>
However, for anything above 127, more or less no two character encodings agree. In fact, the overwhelming majority of characters above 127 don't even exist in the overwhelming majority of characters sets, thus don't have an encoding in the vast majority of character encodings.
In other words: it is practically impossible to find a single default encoding for characters above 127.
Therefore, the encoding that is chosen by Ruby is Encoding::BINARY, which is basically a pseudo-encoding that means "this isn't actually text, this is unstructured unknown binary data". (For hysterical raisins, this encoding is also aliased to ASCII-8BIT, which I find absolutely horrible, because ASCII is 7 bit, period, and anything using the 8th bit is by definition not ASCII.)
128.chr.encoding
#=> #<Encoding:ASCII-8BIT>
255.chr.encoding
#=> #<Encoding:ASCII-8BIT>
Note also that Integer#chr is limited to a single octet, i.e. to a range from 0 to 255, so multi-octet or variable-length encodings are not really required here.
Isn't Ruby supposed to use UTF-8 everywhere by default?
Which encoding are you talking about? Ruby has about a half dozen of them.
For the vast majority of encodings, your statement is incorrect.
the locale encoding is the default encoding of the environment
the filesystem encoding is the encoding that is used for file paths: the value is determined by the file system
the external encoding of an IO object is the encoding that text that this read is assumed to be in and text that is written is transcoded to: the default is the locale encoding
the internal encoding of an IO object is the encoding that Strings that are written to the IO object must be in and that Strings that are read from the IO object are transcoded into: the default is the default internal encoding, whose default value, in turn, is nil, meaning no transcoding occurs
the script encoding is the encoding that a Ruby script is read, and also String literals in the script will inherit this encoding: it is set with a magic comment at the beginning of the script, and the default is UTF-8
So, as you can see, there are many different encodings, and many different defaults, and only one of them is UTF-8. And none of those encodings are actually relevant to your question, because 128.chr is neither a String literal nor an IO object. It is a String object that is created by the Integer#chr method using whatever encoding it sees fit.

Related

How do I filter out invisible characters without affecting Japanese character set?

I noticed that some of my input is getting U+2028. I don't know what this is, but how can I prevent this with consideration of UTF-8 and English/Japanese characters?
The character U+2028 is LINE SEPARATOR and is one of space characters.
To select only the Japanese characters is (I am afraid) quite tricky in the Unicode space, because CJK characters spread all over across so many planes, even though Ruby supports an extensive Unicode category format in Regexp like \p{Hiragana}. However, if your only interest is Japanese and ASCII, the NKF library is useful. Here is an example:
require 'nkf'
orig = "b2αÇ()あ相〜\u2028\u3000_━●★】"
p orig
p NKF.nkf('-w -E', NKF.nkf('-e', orig))
# =>
# "b2αÇ()あ相〜\u2028 _━●★】"
# "b2α()あ相〜 _━●★】"
As you see, the unicode character U+2028 is filtered out, whereas a Greek character "α" is preserved because it is included in the Japanese JIS-X-0208 code. Note the accented alphabets like "Ç" are filtered out, because they are not included. The set of so-called hankaku-kana is filtered out (Edited-from) converted into zenkaku-kana (Edited-to) in this formula. The JIS-X-0212 character set is not supported, either.
A solution for your specific case.
I have come up with other solutions (for Ruby 2) in addition to the solution with the NKF library. The comparison as described below is in a way interesting, as they are slightly different from one another. This is a major revision, and so I am posting it as a separate answer. I am also describing the background about this at the end of this post.
I am assuming the original input is in UTF-8 encoding except for the first section (if not, convert it to the UTF-8 first to apply any of the examples).
Solutions to filtering out illegitimate characters
"illegitimate" means the character code that is not included in the encoding defined for a String instance.
In Ruby 2, such String usually should have the encoding ASCII-8BIT. However, some may wrongly have UTF-8 encoding.
If it has the encoding ASCII-8BIT, but if you want to get a legitimate UTF-8 String,
s1 = String.new("あ\x99", encoding: 'ASCII-8BIT') # An example ASCII-8BIT
# => "\xE3\x81\x82\x99"
s1.encoding # => #<Encoding:ASCII-8BIT>
s1.valid_encoding? # => true because 'ASCII-8BIT' accepts anything.
s1.force_encoding('UTF-8')
# => s1=="あ\x99"
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
If it has wrongly the encoding UTF-8, and if you want to filter out the illegitimate codepoints,
s1 = String.new("あ\x99", encoding: 'UTF-8') # An example 'UTF-8'
# => "あ\x99"
s1.encoding # => #<Encoding:UTF-8>
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
Solutions to filtering out "non-Japanese" characters
All the following methods are to filter out "non-Japanese" characters.
Basically, "non-Japanese" characters are those that are not included in one or more of the traditional standard of the Japanese character set.
See the next section for the detailed background of the definition of the "non-Japanese" characters.
The strategy here is to convert the encoding of the original String to a Japanese JIS encoding (referred to as ISO-2022-JP or EUC-JP; basically JIS-X-0208) and to convert back to UTF-8.
Use String#encode
Ruby-2 built-in String#encode does the exact job.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
print "Orig:"; p orig
print "Enc: "; p orig.encode('ISO-2022-JP', undef: :replace, replace: '').encode('UTF-8')
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": filtered out
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: filtered out
Circled Digit: filtered out
Unicode Line Separator: filtered out
Use NKF library
The NKF library is one of the standard libraries that come with the official Ruby release.
The library is traditional and has been used for decades; cf., NKF stands for Network Kanji Filter.
It does a very similar, though slightly different, job to/from Ruby Encoding.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'nkf'
print "NKF: "; p NKF.nkf('-w -E', NKF.nkf('-e', orig))
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": converted into "zenkaku" (aka full-width)
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: preserved
Circled Digit: preserved
Unicode Line Separator: filtered out
Use iconv Gem
Ruby Gem iconv does not come with the standard Ruby anymore (I think it used to, up to Ruby 2.1 or something). But you can easily install it with the gem command like gem install iconv .
It can handle ISO-2022-JP-2, unlike the above-mentioned 2 methods, which may be handy (n.b., the encoding ISO-2022-JP-2 is actually defined in Ruby Encoding, but no conversion is defiend for or from it in Ruby in default). Once installed, the following is an example.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'iconv'
output = ''
Iconv.open('iso-2022-jp-2', 'utf-8') do |cd|
cd.discard_ilseq=true
output = cd.iconv orig << cd.iconv(nil)
end
s2 = Iconv.conv('utf-8', 'iso-2022-jp-2', output)
print "Icon:"; p s2
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": preserved
Euro-sign: preserved
Latin1: preserved
JISX0212: preserved
CJK Compatibility: filtered out
Circled Digit: preserved
Unicode Line Separator: filtered out
Summary
Here are the outputs of the above-mentioned three methods:
Orig:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡㌔③\u2028ハンカク"
Enc: "b2◇〒α()あ相〜 _8D━●★】$£"
NKF: "b2◇〒α()あ相〜 _8D━●★】$£㌔③ハンカク"
Icon:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡③ハンカク"
All the code snippets above here are available as a gist in Github for convenience — download or git clone and run it.
Background
What is an invalid character? The character U+2028, for example as in the question, is a legitimate UTF-8 character (Line Separator). So, there is no general reason to filter such characters out, though some individual situations may require to.
What is an English character? The lower- and upper-case alphabets (52 in total) probably are. Then, how about the dollar sign ($)? Pound sign (£)? Euro sign (€)? The dollar sign is an ASCII character, whereas neither of the pound and Euro signs is not. The pound sign is included in the traditional Latin-1 (ISO-8859-1) character set, whereas the Euro sign is not. As such, what is an English character is not a trivial question.
You may define ASCII (or Latin-1, or whatever) is the only English character set in your definition, but it is somewhat arbitrary.
What is a Japanese character? OK, Hiragana and Katakana are unique to Japanese. How about Kanji? Do you accept simplified Chinese characters, which are not used in Japan, as Kanji? How about symbols? OK, a few symbols, such as 。 (U+3002; Ideographic Full Stop) and 「 (U+300c; Left Corner Bracket) are essential punctuations in the Japanese text. But, is there any reason to regard characters like ▼ (Black Down-Pointing Triangle), which has been used widely among Japanese-language computer users for decades, as Japanese specific? Perhaps not. They are just symbols that can be used anywhere in the world. And worse, it is not a clear cut; for example, although it is perhaps fair to argue Postal Mark 〒 is Japanese specific, it is not an essential punctuation like the full stop but just a symbol fairly popularly used in Japan. I would not be surprised if the very similar symbol is actually used elsewhere in the world, unknown to me.
Being similar to the argument of ASCII and Latin-1 for English characters, you could define the traditionally used characters included in the JIS (X 0208) character set are the valid Japanese characters. Again, it is inevitably arbitrary. For example, the Pound sign (£) is included in it, whereas Euro sign is not. The diamond mark ◇ (White Diamond) is included, whereas the heart mark ♡ (White Heart Suit) is not. Or, what about those so-called "zenkaku" (aka full-width) characters, which are just duplications of alphabets and Arabic numerals of 0 to 9 of ASCII?
After all, the Unicode is the unified set of the characters used in the world regardless of the languages (— well, ideally at least, though you may argue the real Unicode is not quite idealistic). In this sense there is no definite answer to filter out non-English or non-Japanese characters. Consequently, the original question about filtering out U+2028 is one of those arbitrary demands coming from some specific situations, even though it can well be a popular demand in fact (and hence my answer).
Only the definitive thing you could do is to filter out illegitimate characters for the chosen character encoding, such as UTF-8, as described in the first section of this answer. The rest is, really, up to each individual's need in their specific situations.
Background of the "Japanese" character sets
The Japanese character set was traditionally defined in the JIS standards in the official term. Specifically, JIS-X-0208 and much less popular JIS-X-0212 (often casually called "補助漢字") are the two standards (n.b., they have their specific details like 1983 and 1990). Unfortunately, in practice, NEC, Microsoft and Apple adopted their own variations (called broadly Shift_JIS or SJIS, though each has their own variation). Due to the popularity of their OSs, they were (and to some extent still are(!)) more widely used in Japan in reality than the strict official ones before the era where the UTF-8 is widely accepted.
Note that all of them accept the ASCII at least. So, it has been always safe to use ASCII in pretty much any situations (excepting some in early 80s or before).
The Unicode is very inclusive, containing pretty much any of the characters that have been defined in any of these character codesets. That means any of the characters that have once stirred hot debate (whether you should not use or you may) can be legitimately used in (any of) the Unicode encoding now – I mean legitimate as far as the character encoding is concerned.
I presume this confused practical situation has lead to the results as shown above that slightly differ from one another, depending which method you use. Pick your favourite, depending on your need!

Convert UTF-8 to CP1252 ruby 2.2

How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

How to use internal/external encoding when importing a YAML file?

How can I load a YAML file regardlessly of its encoding?
My YAML file can be encoded in UTF-8 or ANSI (that's what Notepad++ calls it - I guess it's Windows-1252):
:key1:
:key2: "ä"
utf8.yml is encoded in UTF-8, ansi.yml is encoded in ANSI. I load the files as follows:
# encoding: utf-8
Encoding.default_internal = "utf-8"
utf8_load = YAML::load(File.open('utf8.yml'))
utf8_load_file = YAML::load_file('utf8.yml')
ansi_load = YAML::load(File.open('ansi.yml'))
ansi_load_file = YAML::load_file('ansi.yml')
It seems like Ruby doesn't recognize the encoding correctly:
utf8_load [:key1][:key2].encoding #=> "UTF-8"
utf8_load_file [:key1][:key2].encoding #=> "UTF-8"
ansi_load [:key1][:key2].encoding #=> "UTF-8"
ansi_load_file [:key1][:key2].encoding #=> "UTF-8"
because the bytes aren't the same:
utf8_load [:key1][:key2].bytes #=> [195, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [239, 191, 189]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
If I miss Encoding.default_internal = "utf-8", the bytes are also different:
utf8_load [:key1][:key2].bytes #=> [195, 131, 194, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [195, 164]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
What happens actually when I don't set the default_internal to utf-8?
Which encodings do the strings in both examples have?
How can I load a file even if I don't know its encoding?
I believe officially YAML only supports UTF-8 (and maybe UTF-16). There have historically been all sorts of encoding confusions in YAML libraries. I think you are going to run into trouble trying to have YAML in something other than a Unicode encoding.
What happens actually when I don't set the default_internal to utf-8?
Encoding.default_internal controls the encoding your input will be converted to when it is read in, at least by some operations that respect Encoding.default_internal, not everything does. Rails seems to set it to UTF-8. So if you don't set the Encoding.default_internal to UTF-8, it might be UTF-8 already anyway.
If Encoding.default_internal is nil, then those operations that respect it, and try to convert any input to Encoding.default_internal upon reading it in won't do that, they'll leave any input in the encoding it was believed to originate in, not try to convert it.
If you set it to something else, like say "WINDOWS-1252" Ruby would automatically convert your stuff to WINDOWS-1252 when it read it in with File.open, which would possibly confuse YAML::load when you pass the string that's now encoded and tagged as WINDOWS-1252 to it. Generally there's no good reason to do this, so leave Encoding.default_internal alone.
Note: The Ruby docs say:
"You should not set ::default_internal in Ruby code as strings created before changing the value may have a different encoding from strings created after the change. Instead you should use ruby -E to invoke Ruby with the correct default_internal."
See also: http://ruby-doc.org/core-1.9.3/Encoding.html#method-c-default_internal
Which encodings do the strings in both examples have?
I don't really know. One would have to have to look at the bytes and try to figure out if they are legal bytes for various plausible encodings, and beyond being legal, if they mean something likely to be intended.
For example take: "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ". That's a perfectly legal UTF-8 string, but as humans we know it's probably not intended, and is probably garbage, quite likely from the result of an encoding misinterpretation. But a computer has no way to know that, it's perfectly legal UTF-8, and, hey, maybe someone actually did mean to write "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ", after all, I just did, when writing this post!
So you can try to interpret the bytes according to various encodings and see if any of them make sense.
You're really just guessing at this point. Which means...
How can I load a file even if I don't know it's encoding?
Generally, you can not. You need to know and keep track of encodings. There's no real way to know what the bytes mean without knowing their encoding.
If you have some legacy data for which you've lost this, you've got to try to figure it out. Manually, or with some code that tries to guess likely encodings based on heuristics. Here's one Ruby gem Charlock Holmes that tries to guess, using the ICU library heuristics (this particular gem only works on MRI).
What Ruby says in response to string.encoding is just the encoding the string is tagged with. The string can be tagged with the wrong encoding, the bytes in the string don't actually mean what is intended in the encoding it's tagged with... in which case you'll get garbage.
Ruby will do the right things with your string instead of creating garbage only if the string's encoding tag is correct. The string's encoding tag is determined by Encoding.default_external for most input operations by default (Encoding.default_external usually starts out as UTF-8, or ASCII-8BIT which really means the null encoding, binary data, not tagged with an encoding), or by passing an argument to File.open: File.open("something", "r:UTF-8" or, means the same thing, File.open("something", "r", :encoding => "UTF-8"). The actual bytes are determined by whatever is in the file. It's up to you to tell Ruby the correct encoding to interpret those bytes as text meaning what they were intended to mean.
There were a couple posts recently to reddit /r/ruby that try to explain how to troubleshoot and workaround encoding issues that you may find helpful:
http://www.justinweiss.com/articles/how-to-get-from-theyre-to-theyre/
http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
Also, this is my favorite article on understanding encoding generally: http://kunststube.net/encoding/
For YAML files in particular, if I were you, I'd just make sure they are all in UTF-8. Life will be much easier and you won't have to worry about it. If you have some legacy ones that have become corrupted, it's going to be a pain to fix them, but that's what you've got to do, unless you can just rewrite them from scratch. Try to fix them to be in valid and correct UTF-8, and from here on out keep all your YAML in UTF-8.
The YAML specification says in "5.1. Character Set":
To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF.
This means that Windows-1252 or ISO-8859-1 encoding are acceptable as long as the characters being output are within the defined range. Windows users tend to use the "the C1 control block #x80-#x9F" range for diacritical and accented characters, so if those are present in a YAML file the file is not going to meet the spec and the YAML generator didn't do its job correctly. And that explains why "ä" isn't acceptable.
On output, a YAML processor must only produce acceptable characters. Any excluded characters must be presented using escape sequences. In addition, any allowed characters known to be non-printable should also be escaped. This isn’t mandatory since a full implementation would require extensive character property tables.
These days, by default, Ruby uses UTF-8, however YAML isn't limited to that. The spec goes on to say in "5.2. Character Encodings":
On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.
If a character stream begins with a byte order mark, the character encoding will be taken to be as as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (#x00) characters.
So, UTF-8, 16 and 32 are supported, but Ruby will assume UTF-8. If the BOM is present you'll see it when you view the file in an editor. I haven't tried loading a UTF-16 or 32 file to see what Ruby's YAML does, so that's left as an experiment.

Convert a unicode string to characters in Ruby?

I have the following string:
l\u0092issue
My question is how to convert it to utf8 characters ?
I have tried that
1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
=> "l\u0092issue"
You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.
In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.
However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’, so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.
What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.
The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.
To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:
2.1.0 :001 > s = "l\u0092issue"
=> "l\u0092issue"
2.1.0 :002 > s = s.encode('iso-8859-1')
=> "l\x92issue"
2.1.0 :003 > s.force_encoding('cp1252')
=> "l\x92issue"
2.1.0 :004 > s.encode('utf-8')
=> "l’issue"
This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.
That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.
Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

What does #encoding BINARY mean in the Ruby code?

There is a line #encoding BINARY in the beginning of the code, what does it mean?
http://ruby.runpaint.org/encoding
Ruby defines an encoding named ASCII-8BIT, with an alias of BINARY, which does not correspond to any known encoding. It is intended to be associated with binary data, such as the bytes that make up a PNG image, so has no restrictions on content. One byte always corresponds with one character. This allows a String, for instance, to be treated as bag of bytes rather than a sequence of characters. ASCII-8BIT, then, effectively corresponds to the absence of an encoding, so methods that expect an encoding name recognise nil as a synonym.
That line is how we tell the Ruby interpreter to expect a certain character set in the source file.
James Grey has a great series on dealing with character encodings in Ruby. In particular, "Ruby 1.9's Three Default Encodings " might be good reading if you want to understand the details.

Resources