Ruby: ARGV breaks accented characters - ruby

# encoding: utf-8
foo = "Résumé"
p foo
> "Résumé"
# encoding: utf-8
ARGV.each do |argument|
p argument
end
test.rb Résumé > "R\xE9sum\xE9"
Why does this occur, and how can I get ARGV to return "Résumé"?
I have chcp 65001 set already and am using ruby 1.9.2p290 (2011-07-09) [i386-mingw32]
EDIT After asking around on irc, I was instructed to do chcp 1252>NUL which fixed the problem.

For some reason, Windows doesn't use UTF-8 in your console. So, although Ruby expects UTF-8 encoded string, it gets Windows-1252 encoded string.
So you have several possibilities (which I can't test as I, fortunately, don't use Windows):
Persuade Windows to use UTF-8 in your console. I don't know if chcp should work and, if so, why it doesn't.
Tell Ruby to use Windows-1252 instead of UTF-8 as default
Convert ARGV from Windows-1252 to UTF-8 manually:
Example:
>> argument = "R\xE9sum\xE9"
=> "R\xE9sum\xE9"
>> argument.force_encoding('windows-1252').encode('utf-8')
=> "Résumé"

Related

Ruby irb utf-8 encoding problem on windows 10 terminal input

I want to use ruby with terminal input in my windows. Why ruby community can not solve this UTF-8 issue on windows? Is it hard? I am wondering how python, java or other langs did this? I can work greatly with python on windows utf-8 with no pain.
With ruby 3.0.1
x = gets.chomp
çağrı
=> "\x87a\xA7r\x8D"
puts x
�a�r�
=> nil
x.valid_encoding?
=> false
I looked up this https://bugs.ruby-lang.org/issues/16604
it did not work.
With Ruby 3.0, the default external encoding (i.e. the assumed encoding of any data read from outside the ruby process such as from your shell when using gets) changed to UTF-8 on Windows. This was a response to various issues occuring with encoding on Windows.
The data you are reading there from your shell, however, is not UTF-8 encoded. Instead, it appears your shell uses some different encoding, e.g. cp850.
A possible workaround would be to instruct Ruby to assume the locale encoding of your environment which you can set with the -E switch on the command invocation, e.g.:
irb -E locale
or by setting Encoding.default_external manually in your script to the correct encoding of your environment.
On Turkish windows PC's cmd shell uses encoding of CP857
You can see it at cmd > preferences section
Here is the practice solution with contributions of Holger.
irb(main):005:0> x = gets.chomp
Here is the Turkish chars ğĞüÜşŞiİıIöÖçÇ
=> "Here is the Turkish chars \xA7\xA6\x81\x9A\x9F\x9Ei\x98\x8DI\x94\x99\x87\x80"
irb(main):006:0> x.force_encoding "CP857"
=> "Here is the Turkish chars \xA7\xA6\x81\x9A\x9F\x9Ei\x98\x8DI\x94\x99\x87\x80"
irb(main):007:0> x.valid_encoding?
=> true
irb(main):008:0> x.encode("UTF-8", undef: :replace)
=> "Here is the Turkish chars ğĞüÜşŞiİıIöÖçÇ"

How to consistently get ASCII-8BIT encoding in Ruby?

Ruby seems a bit inconsistent in its handling of encodings:
irb -E BINARY:BINARY
irb(main):001:0> "hi".encoding
=> #<Encoding:ASCII-8BIT>
So that "works". Now what about plain ruby?
ruby -E BINARY:BINARY -e 'p "hi".encoding'
#<Encoding:US-ASCII>
That doesn't work. Furthermore, when p "hi".encoding is placed in x.rb, the output of ruby -E BINARY:BINARY x.rb is:
#<Encoding:UTF-8>
How do I get ASCII-8BIT literals when invoking ruby?
String literals have the same encoding as the script encoding. Instead of 'hi'.encoding you can use the keyword __ENCODING__ to retrieve it. The script encoding can be changed by putting a magic comment at the beginning of your script:
# encoding: ASCII-8BIT
p __ENCODING__ # => #<Encoding:ASCII-8BIT>
The -E flag of ruby doesn't affect the encoding of string literals. It's only for changing the external and internal encoding. You can read about the various type of encodings and their purpose in the Encoding documentation.
Back to the encoding of string literals: Even though irb claims its -E flag is the "Same as ruby -E" that isn't true. It uses the external encoding as script encoding. irb already has several limitations. This could be one of them. It's at least a documentation bug.
Besides the magic comment there's another discouraged way to set the script encoding via ruby: the -K flag and the n (none) kcode. ruby -Kne "p __ENCODING__" should print #<Encoding:ASCII-8BIT>. However -K also changes the external encoding.

What encoding are Ruby Strings in?

Is it true that Ruby Strings are just a sequence of Unicode characters? If so, what specific encoding e.g. is it UTF-8, etc.?
The default encoding of a String is the same as the source file.
The default encoding of the source file is UTF-8 in Ruby 2.0 or later, or US-ASCII in Ruby 1.9 or earlier. You can specify the encoding by adding
# encoding: utf-8
in the beginning of a source file.
By default, Ruby strings are indeed UTF-8, as can be verified by the String#encoding method:
llama#llama:~$ irb
irb(main):001:0> 'foo'.encoding
=> #<Encoding:UTF-8>
You can get a list of available encodings via Encoding::list:
irb(main):002:0> Encoding.list
=> [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>, #<Encoding:US-ASCII>, (etc...)]
And change the encoding of a string with String#force_encoding:
irb(main):003:0> 'foo'.force_encoding(Encoding::US_ASCII).encoding
=> #<Encoding:US-ASCII>

Encoding files with utf-8 using Ruby

I can't set ruby to use the utf-8 for encoding files.
Script like this
# encoding: UTF-8
puts "ą"
works fine
but such
# encoding: UTF-8
File.open("test.txt", "w:UTF-8") do |f|
f.write "ą"
end
causes the console pops up
task.rb: 4: invalid multibyte char (UTF-8)
despite the fact that all commands turning on utf-8 encoding are applied.
I'm using ruby 2.0.0-p451 from rubyinstaller for windows.
Ok everything works fine, I just changed enconding in notepad++ from ansi to utf-8.

How to solve this iconv exception?

The following code:
def convertToUTF8 str
str = Iconv.conv('ASCII//IGNORE', 'UTF8', str).gsub("\x00", "") # line 585
str.chomp!
str
end
Raises exception:
myfile.rb:585: in `conv': invalid encoding ("ASCII", "UTF8") (Iconv::InvalidEncoding)
This error is on my macbook pro. But same code can run one my peer's macbook pro. So I suppose it should be a environment problem.
I tried to use following code:
# The following two lines are in bar.rb
require 'iconv'
puts Iconv.list
Then execute bar.rb:
$ ruby bar.rb | grep UTF # ruby version is 1.9.1-p376
UTF-8
UTF-8-MAC
UTF8-MAC
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UNICODE-1-1-UTF-7
UTF-7
CSUNICODE11UTF7
But if I switch to ruby 2.0, then the list contains "UTF8".
So the problem turns to how to install "UTF8" for Ruby 1.9.1?

Resources