Is UTF-8 the default encoding in Ruby v.2? - ruby

Matz wrote in his book that in order to use UTF-8, you must add a coding comment on the first line of your script. He gives us an example:
# -*- coding: utf-8 -*- # Specify Unicode UTF-8 characters
# This is a string literal containing a multibyte multiplication character
s = "2x2=4"
# The string contains 6 bytes which encode 5 characters
s.length # => 5: Characters: '2' 'x' '2' '=' '4'
s.bytesize # => 6: Bytes (hex): 32 c3 97 32 3d 34
When he invokes bytesize, it returns 6 since the multiplication symbol × is outside the ascii set, and must be represented by unicode with the two bytes.
I tried the exercise and without specifying the coding comment, it recognized the multiplication symbol as two bytes:
'×'.encoding
=> #<Encoding:UTF-8>
'×'.bytes.to_a.map {|dec| dec.to_s(16) }
=> ["c3", "97"]
So it appears utf-8 is the default encoding. Is this a recent addition to Ruby 2? His examples were from Ruby 1.9.

Yes. The fact that UTF-8 is the default encoding is only since Ruby 2.
If you are aware that his examples were from Ruby 1.9, then check the newly added features to the newer versions of Ruby. It is not that much.

Related

Invalid UTF-8 Ruby strings

I'm running into some strange behaviour and inconsistency in the way that Ruby (v2.5.3) deals with encoded strings versus the YAML parser. Here's an example:
"\x80" # Returns "\x80"
"\x80".bytesize # Returns 1
"\x80".bytes # Returns [128]
"\x80".encoding # Returns UTF-8
YAML.load('{value: "\x80"}')["value"] # Returns "\u0080"
YAML.load('{value: "\x80"}')["value"].bytesize # Returns 2
YAML.load('{value: "\x80"}')["value"].bytes # Returns [194, 128]
YAML.load('{value: "\x80"}')["value"].encoding # Returns UTF-8
My understanding of UTF-8 is that any single-byte value above 0x7F should be encoded into two bytes. So my questions are the following:
Is the one byte string "\x80" valid UTF-8?
If so, why does YAML convert into a two-byte pattern?
If not, why is Ruby claiming the encoding is UTF-8 but containing an invalid byte sequence?
Is there a way to make the YAML parser and the Ruby string behave in the same way as each other?
It is not valid UTF-8
"\x80".valid_encoding?
# false
Ruby is claiming it is UTF-8 because all String literals are UTF-8 by default, even if that makes them invalid.
I don't think you can force the YAML parser to return invalid UTF-8. But to get Ruby to convert that character you can do this
"\x80".b.ord.chr('utf-8')
# "\u0080"
.b is only available in Ruby 2+. You need to use force_encoding otherwise.

Ruby incompatible character encodings

I am currently trying to write a script that iterates over an input file and checks data on a website. If it finds the new data, it prints out to the terminal that it passes, if it doesn't it tells me it fails. And vice versa for deleted data. It was working fine until the input file I was given contains the "™" character. Then when ruby gets to that line, it is spitting out an error:
PDAPWeb.rb:73:in `include?': incompatible character encodings: UTF-8 and IBM437
(Encoding::CompatibilityError)
The offending line is a simple check to see if the text exists on the page.
if browser.text.include? (program_name)
Where the program_name variable is a parsed piece of information from the input file. In this instance, the program_name contains the 'TM' character mentioned before.
After some research I found that adding the line # encoding: utf-8 to the beginning of my script could help, but so far has not proven useful.
I added this to my program_name variable to see if it would help(and it allowed my script to run without errors), but now it is not properly finding the TM character when it should be.
program_name = record[2].gsub("\n", '').force_encoding("utf-8").encode("IBM437", replace: nil)
This seemed to convert the TM character to this: Γäó
I thought maybe i had IBM437 and utf-8 parts reversed, so I tried the opposite
program_name = record[2].gsub("\n", '').force_encoding("IBM437").encode("utf-8", replace: nil)
and am now receiving this error when attempting to run the script
PDAPWeb.rb:48:in `encode': U+2122 from UTF-8 to IBM437 (Encoding::UndefinedConve
rsionError)
I am using ruby 1.9.3p392 (2013-02-22) and I'm not sure if I should upgrade as this is the standard version installed in my company.
Is my encoding incorrect and causing it to convert the TM character with errors?
Here’s what it looks like is going on. Your input file contains a ™ character, and it is in UTF-8 encoding. However when you read it, since you don’t specify the encoding, Ruby assumes it is in your system’s default encoding of IBM437 (you must be on Windows).
This is basically the same as this:
>> input = "™"
=> "™"
>> input.encoding
=> #<Encoding:UTF-8>
>> input.force_encoding 'ibm437'
=> "\xE2\x84\xA2"
Note that force_encoding doesn’t change the actual string, just the label associated with it. This is the same outcome as in your case, only you arrive here via a different route (by reading the file).
The web page also has a ™ symbol, and is also encoded as UTF-8, but in this case Ruby has the encoding correct (Watir probably uses the headers from the page):
>> web_page = '™'
=> "™"
>> web_page.encoding
=> #<Encoding:UTF-8>
Now when you try to compare these two strings you get the compatibility error, because they have different encodings:
>> web_page.include? input
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and IBM437
from (irb):11:in `include?'
from (irb):11
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
If either of the two strings only contained ASCII characters (i.e. code points less that 128) then this comparison would have worked. Both UTF-8 and IBM437 are both supersets of ASCII, and are only incompatible if they both contain characters outside of the ASCII range. This is why you only started seeing this behaviour when the input file had a ™.
The fix is to inform Ruby what the actual encoding of the input file is. You can do this with the already loaded string:
>> input.force_encoding 'utf-8'
=> "™"
You can also do this when reading the file, e.g. (there are a few ways of reading files, they all should allow you to explicitly specify the encoding):
input = File.read("input_file.txt", :encoding => "utf-8")
# now input will be in the correct encoding
Note in both of these the string isn’t being changed, it still contains the same bytes, but Ruby now knows its correct encoding.
Now the comparison should work okay:
>> web_page.include? input
=> true
There is no need to encode the string. Here’s what happens if you do. First if you correct the encoding to UTF-8 then encode to IBM437:
>> input.force_encoding("utf-8").encode("IBM437", replace: nil)
Encoding::UndefinedConversionError: U+2122 from UTF-8 to IBM437
from (irb):16:in `encode'
from (irb):16
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
IBM437 doesn’t include the ™ character, so you can’t encode a string containing it to this encoding without losing data. By default Ruby raises an exception when this happens. You can force the encoding by using the :undef option, but the symbol is lost:
>> input.force_encoding("utf-8").encode("IBM437", :undef => :replace)
=> "?"
If you go the other way, first using force_encoding to IBM437 then encoding to UTF-8 you get the string Γäó:
>> input.force_encoding("IBM437").encode("utf-8", replace: nil)
=> "Γäó"
The string is already in IBM437 encoding as far as Ruby is concerned, so force_encoding doesn’t do anything. The UTF-8 representation of ™ is the three bytes 0xe2 0x84 0xa2, and when interpreted as IBM437 these bytes correspond to the three characters seen here which are then converted into their UTF-8 representations.
(These two outcomes are the other way round from what you describe in the question, hence my comment above. I’m assuming that this is just a copy-and-paste error.)

Converting gsub() pattern from ruby 1.8 to 2.0

I have a ruby program that I'm trying to upgrade form ruby 1.8 to ruby 2.0.0-p247.
This works just fine in 1.8.7:
begin
ARGF.each do |line|
# a collection of pecluliarlities, appended as they appear in data
line.gsub!("\x92", "'")
line.gsub!("\x96", "-")
puts line
end
rescue => e
$stderr << "exception on line #{$.}:\n"
$stderr << "#{e.message}:\n"
$stderr << #line
end
But under ruby 2.0, this results in this an exxeption when encountering the 96 or 92 encoded into a data file that otherwise contains what appears to be ASCII:
invalid byte sequence in UTF-8
I have tried all manner of things: double backslashes, using a regex object instead of the string, force_encoding(), etc. and am stumped.
Can anybody fill in the missing puzzle piece for me?
Thanks.
=============== additions: 2013-09-25 ============
Changing \x92 to \u2019 did not fix the problem.
The program does not error until it actually hits a 92 or 96 in the input file, so I'm confused as to how the character pattern in the string is the problem when there are hundreds of thousands of lines of input data that are matched against the patterns without incident.
It's not the regex that's throwing the exception, it's the Ruby compiler. \x92 and \x96 are how you would represent ’ and – in the windows-1252 encoding, but Ruby expects the string to be UTF-8 encoded. You need to get out of the habit of putting raw byte values like \x92 in your string literals. Non-ASCII characters should be specified by Unicode escape sequences (in this case, \u2019 and \u2013).
It's a Unicode world now, stop thinking of text in terms of bytes and think in terms of characters instead.

Confusion with ARGF#set_encoding

ARGF.set_encoding says:
If single argument is specified, strings read from ARGF are tagged with the encoding specified.
If two encoding names separated by a colon are given, e.g. "ascii:utf-8", the read string is converted from the first encoding (external encoding) to the second encoding (internal encoding), then tagged with the second encoding.
So I tried the below:
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding('ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
No encoding change is found. So please advice me the correct approach.
Encoding IBM437 and ASCII (and UTF-8) has the same byte sequence for ASCII characters. So you won't see the difference from String#inspect. However, you can check the String#encoding value for the input strings.
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/).map{|s| s.encoding}
In Ruby (1.9 and higher version), String is a byte sequence tagged with some encoding. You can get the encoding from String#encoding.
So the Chinese word "中" can be represented different ways:
e4 b8 ad # tagged with encoding UTF-8
d6 d0 # tagged with encoding GBK
2d 4e # tagged with encoding UTF-16le
I will always write my script in UTF-8, that is, the internal encoding for my script is UTF-8. Some times I want to process text file (e.g. named "a.txt" and has content "中") encoded with GBK. Then I can set the external encoding and the internal encoding for the IO object and Ruby will do the conversion for me.
ARGF.set_encoding('GBK', 'UTF-8')
str = ARGF.readline
puts str.encoding
# run $ script.rb a.txt
Ruby reads "\xd6\xd0" from "a.txt" and since I have specified the external encoding as GBK, it tags the data with encoding GBK. And I have specified the internal encoding as UTF-8 so Ruby do a conversion from GBK byte sequence to UTF-8, which results in "\xe4\xb8\xad" with tag UTF-8. And this string has the same encoding as other strings in my script, so I can use it with ease.
This is useful because a lot of String methods fail when the two String operands has different, incompatible encoding. For example:
# encoding: utf-8
a = "中" # tagged with UTF-8
b = "中".encode('gbk') # tagged with GBK
puts a + b
#=> Encoding::CompatibilityError: incompatible character encodings: UTF-8 and GBK

How do I translate or strip character sequences like "\xC2\xBB" in my strings?

How do I translate or strip character sequences like "\xC2\xBB" in my strings in Ruby 1.9.2?
You will usually see hex bytes like that when the string is using an encoding that does not handle those bytes. If you know what encoding the string is supposed to be using, you can use String#force_encoding to re-interpret the bytes according to your desired encoding.
# Under a UTF-8 locale:
ruby-1.9.2-head :013 > "\xC2\xBB".force_encoding(Encoding::UTF_8)
=> "»"
# Under the “C” locale:
ruby-1.9.2-head :007 > "\xC2\xBB".force_encoding(Encoding::UTF_8)
=> "\u00BB"
Both result in the same UTF-8 encoded string internally. When under the C locale, Ruby prints an escaped version to avoid printing binary data to the terminal (which, according to the locale setting, might not support it).
If the string is already using the appropriate encoding, then you should re-encode the string to your desired output encoding before using it:
# Under a UTF-8 locale:
ruby-1.9.2-head :026 > "\xC2\xBB".force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)
=> "»"
# Under the “C” locale:
ruby-1.9.2-head :014 > "\xC2\xBB".force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)
=> "\u00C2\u00BB"
Above, I use String#force_encoding to make sure the bytes in the string are are flagged as ISO 8859-1 (because, for instance, a header accompanying the bytes said that they represented an ISO 8859-1 encoded string) and then use String#encode re-encode it as UTF-8 (the desired output encoding).
Finally, if you really just want to strip out anything that is not ASCII, you could use the negated [:ascii:] character class with String#gsub:
ruby-1.9.2-head :030 > "foo\xC2\xBBbar".force_encoding(Encoding::UTF_8).gsub(/[[:^ascii:]]/,'')
=> "foobar"

Resources