How to consistently get ASCII-8BIT encoding in Ruby? - ruby

Ruby seems a bit inconsistent in its handling of encodings:
irb -E BINARY:BINARY
irb(main):001:0> "hi".encoding
=> #<Encoding:ASCII-8BIT>
So that "works". Now what about plain ruby?
ruby -E BINARY:BINARY -e 'p "hi".encoding'
#<Encoding:US-ASCII>
That doesn't work. Furthermore, when p "hi".encoding is placed in x.rb, the output of ruby -E BINARY:BINARY x.rb is:
#<Encoding:UTF-8>
How do I get ASCII-8BIT literals when invoking ruby?

String literals have the same encoding as the script encoding. Instead of 'hi'.encoding you can use the keyword __ENCODING__ to retrieve it. The script encoding can be changed by putting a magic comment at the beginning of your script:
# encoding: ASCII-8BIT
p __ENCODING__ # => #<Encoding:ASCII-8BIT>
The -E flag of ruby doesn't affect the encoding of string literals. It's only for changing the external and internal encoding. You can read about the various type of encodings and their purpose in the Encoding documentation.
Back to the encoding of string literals: Even though irb claims its -E flag is the "Same as ruby -E" that isn't true. It uses the external encoding as script encoding. irb already has several limitations. This could be one of them. It's at least a documentation bug.
Besides the magic comment there's another discouraged way to set the script encoding via ruby: the -K flag and the n (none) kcode. ruby -Kne "p __ENCODING__" should print #<Encoding:ASCII-8BIT>. However -K also changes the external encoding.

Related

Ruby incompatible character encodings

I am currently trying to write a script that iterates over an input file and checks data on a website. If it finds the new data, it prints out to the terminal that it passes, if it doesn't it tells me it fails. And vice versa for deleted data. It was working fine until the input file I was given contains the "™" character. Then when ruby gets to that line, it is spitting out an error:
PDAPWeb.rb:73:in `include?': incompatible character encodings: UTF-8 and IBM437
(Encoding::CompatibilityError)
The offending line is a simple check to see if the text exists on the page.
if browser.text.include? (program_name)
Where the program_name variable is a parsed piece of information from the input file. In this instance, the program_name contains the 'TM' character mentioned before.
After some research I found that adding the line # encoding: utf-8 to the beginning of my script could help, but so far has not proven useful.
I added this to my program_name variable to see if it would help(and it allowed my script to run without errors), but now it is not properly finding the TM character when it should be.
program_name = record[2].gsub("\n", '').force_encoding("utf-8").encode("IBM437", replace: nil)
This seemed to convert the TM character to this: Γäó
I thought maybe i had IBM437 and utf-8 parts reversed, so I tried the opposite
program_name = record[2].gsub("\n", '').force_encoding("IBM437").encode("utf-8", replace: nil)
and am now receiving this error when attempting to run the script
PDAPWeb.rb:48:in `encode': U+2122 from UTF-8 to IBM437 (Encoding::UndefinedConve
rsionError)
I am using ruby 1.9.3p392 (2013-02-22) and I'm not sure if I should upgrade as this is the standard version installed in my company.
Is my encoding incorrect and causing it to convert the TM character with errors?
Here’s what it looks like is going on. Your input file contains a ™ character, and it is in UTF-8 encoding. However when you read it, since you don’t specify the encoding, Ruby assumes it is in your system’s default encoding of IBM437 (you must be on Windows).
This is basically the same as this:
>> input = "™"
=> "™"
>> input.encoding
=> #<Encoding:UTF-8>
>> input.force_encoding 'ibm437'
=> "\xE2\x84\xA2"
Note that force_encoding doesn’t change the actual string, just the label associated with it. This is the same outcome as in your case, only you arrive here via a different route (by reading the file).
The web page also has a ™ symbol, and is also encoded as UTF-8, but in this case Ruby has the encoding correct (Watir probably uses the headers from the page):
>> web_page = '™'
=> "™"
>> web_page.encoding
=> #<Encoding:UTF-8>
Now when you try to compare these two strings you get the compatibility error, because they have different encodings:
>> web_page.include? input
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and IBM437
from (irb):11:in `include?'
from (irb):11
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
If either of the two strings only contained ASCII characters (i.e. code points less that 128) then this comparison would have worked. Both UTF-8 and IBM437 are both supersets of ASCII, and are only incompatible if they both contain characters outside of the ASCII range. This is why you only started seeing this behaviour when the input file had a ™.
The fix is to inform Ruby what the actual encoding of the input file is. You can do this with the already loaded string:
>> input.force_encoding 'utf-8'
=> "™"
You can also do this when reading the file, e.g. (there are a few ways of reading files, they all should allow you to explicitly specify the encoding):
input = File.read("input_file.txt", :encoding => "utf-8")
# now input will be in the correct encoding
Note in both of these the string isn’t being changed, it still contains the same bytes, but Ruby now knows its correct encoding.
Now the comparison should work okay:
>> web_page.include? input
=> true
There is no need to encode the string. Here’s what happens if you do. First if you correct the encoding to UTF-8 then encode to IBM437:
>> input.force_encoding("utf-8").encode("IBM437", replace: nil)
Encoding::UndefinedConversionError: U+2122 from UTF-8 to IBM437
from (irb):16:in `encode'
from (irb):16
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
IBM437 doesn’t include the ™ character, so you can’t encode a string containing it to this encoding without losing data. By default Ruby raises an exception when this happens. You can force the encoding by using the :undef option, but the symbol is lost:
>> input.force_encoding("utf-8").encode("IBM437", :undef => :replace)
=> "?"
If you go the other way, first using force_encoding to IBM437 then encoding to UTF-8 you get the string Γäó:
>> input.force_encoding("IBM437").encode("utf-8", replace: nil)
=> "Γäó"
The string is already in IBM437 encoding as far as Ruby is concerned, so force_encoding doesn’t do anything. The UTF-8 representation of ™ is the three bytes 0xe2 0x84 0xa2, and when interpreted as IBM437 these bytes correspond to the three characters seen here which are then converted into their UTF-8 representations.
(These two outcomes are the other way round from what you describe in the question, hence my comment above. I’m assuming that this is just a copy-and-paste error.)

Can I programmatically convert "I’d" to "I’d" using Ruby?

I can't seem to find the right combination of String#encode shenanigans.
I think I'd got confused on this one so I'll post this here to hopefully help anyone else who is similarly confused.
I was trying to do my encoding in an irb session, which gives you
irb(main):002:0> 'I’d'.force_encoding('UTF-8')
=> "I’d"
And if you try using encode instead of force_encoding then you get
irb(main):001:0> 'I’d'.encode('UTF-8')
=> "I’d"
This is with irb set to use an output and input encoding of UTF-8. In my case to convert that string the way I want it involves telling Ruby that the source string is in windows-1252 encoding. You can do this by using the -E argument in which you specify `inputencoding:outputencoding' and then you get this
$ irb -EWindows-1252:UTF-8
irb(main):001:0> 'I’d'
=> "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d"
That looks wrong unless you pipe it out, which gives this
$ ruby -E Windows-1252:UTF-8 -e "puts 'I’d'"
I’d
Hurrah. I'm not sure about why Ruby showed it as "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d" (something to do with the code page of the terminal?) so if anyone can comment with further insight that would be great.
I expect your script is using the encoding cp1251 and you have ruby >= 1.9.
Then you can use force_encoding:
#encoding: cp1251
#works also with encoding: binary
source = 'I’d'
puts source.force_encoding('utf-8') #-> I’d
If my exceptions are wrong: Which encoding do you use and which ruby version?
A little background:
Problems with encoding are difficult to analyse. There may be conflicts between:
Encoding of the source code (That's defined by the editor).
Expected encoding of the source code (that's defined with #encoding on the first line). This is used by ruby.
Encoding of the string (see e.g. section String encodings in http://nuclearsquid.com/writings/ruby-1-9-encodings/ )
Encoding of the output shell

What encoding are Ruby Strings in?

Is it true that Ruby Strings are just a sequence of Unicode characters? If so, what specific encoding e.g. is it UTF-8, etc.?
The default encoding of a String is the same as the source file.
The default encoding of the source file is UTF-8 in Ruby 2.0 or later, or US-ASCII in Ruby 1.9 or earlier. You can specify the encoding by adding
# encoding: utf-8
in the beginning of a source file.
By default, Ruby strings are indeed UTF-8, as can be verified by the String#encoding method:
llama#llama:~$ irb
irb(main):001:0> 'foo'.encoding
=> #<Encoding:UTF-8>
You can get a list of available encodings via Encoding::list:
irb(main):002:0> Encoding.list
=> [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>, #<Encoding:US-ASCII>, (etc...)]
And change the encoding of a string with String#force_encoding:
irb(main):003:0> 'foo'.force_encoding(Encoding::US_ASCII).encoding
=> #<Encoding:US-ASCII>

Ruby: ARGV breaks accented characters

# encoding: utf-8
foo = "Résumé"
p foo
> "Résumé"
# encoding: utf-8
ARGV.each do |argument|
p argument
end
test.rb Résumé > "R\xE9sum\xE9"
Why does this occur, and how can I get ARGV to return "Résumé"?
I have chcp 65001 set already and am using ruby 1.9.2p290 (2011-07-09) [i386-mingw32]
EDIT After asking around on irc, I was instructed to do chcp 1252>NUL which fixed the problem.
For some reason, Windows doesn't use UTF-8 in your console. So, although Ruby expects UTF-8 encoded string, it gets Windows-1252 encoded string.
So you have several possibilities (which I can't test as I, fortunately, don't use Windows):
Persuade Windows to use UTF-8 in your console. I don't know if chcp should work and, if so, why it doesn't.
Tell Ruby to use Windows-1252 instead of UTF-8 as default
Convert ARGV from Windows-1252 to UTF-8 manually:
Example:
>> argument = "R\xE9sum\xE9"
=> "R\xE9sum\xE9"
>> argument.force_encoding('windows-1252').encode('utf-8')
=> "Résumé"

Ruby 1.9 -Ku, mem_cache_store and invalid multibyte escape error

Originally this bug was posted here: https://rails.lighthouseapp.com/projects/8994/tickets/5713-ruby-19-ku-incompatible-with-mem_cache_store
And now, as we've run into the same issue, I'll copy here a question from that issue, hoping someone have an answer already:
When Ruby 1.9 is started in unicode mode (-Ku), mem_cache_store.rb fails to parse:
/usr/local/ruby19/bin/ruby -Ku /usr/local/ruby-1.9.2-p0/lib/ruby/gems/1.9.1/gems/
activesupport-3.0.0/lib/active_support/cache/mem_cache_store.rb
/usr/local/ruby-1.9.2-p0/lib/ruby/gems/1.9.1/gems/activesupport-3.0.0/lib/active_support/
cache/mem_cache_store.rb:32: invalid multibyte escape: /[\x00-\x20%\x7F-\xFF]/
Our case is practically identical: when you set config.action_controller.cache_store to :mem_cache_store, and try to run tests, console, or server, you recieve this in return:
/Users/%username%/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.1/lib/active_support/
cache/mem_cache_store.rb:32: invalid multibyte escape: /[\x00-\x20%\x7F-\xFF]/
Any ideas how this can be avoided?..
Ruby 1.9 in unicode mode will attempt to interpret the regular expression as unicode. To avoid this you need to pass the regular expression option "n" for "no encoding":
ESCAPE_KEY_CHARS = /[\x00-\x20%\x7F-\xFF]/n
Now we have our raw 8-bit encoding (the only thing Ruby 1.8 speaks) as intended:
ruby-1.9.2-p136 :001 > ESCAPE_KEY_CHARS = /[\x00-\x20%\x7F-\xFF]/n.encoding
=> # <Encoding:ASCII-8BIT>
Hopefully the Rails teams fixes this, for now you have to edit the file.

Resources