Ruby irb utf-8 encoding problem on windows 10 terminal input - ruby

I want to use ruby with terminal input in my windows. Why ruby community can not solve this UTF-8 issue on windows? Is it hard? I am wondering how python, java or other langs did this? I can work greatly with python on windows utf-8 with no pain.
With ruby 3.0.1
x = gets.chomp
çağrı
=> "\x87a\xA7r\x8D"
puts x
�a�r�
=> nil
x.valid_encoding?
=> false
I looked up this https://bugs.ruby-lang.org/issues/16604
it did not work.

With Ruby 3.0, the default external encoding (i.e. the assumed encoding of any data read from outside the ruby process such as from your shell when using gets) changed to UTF-8 on Windows. This was a response to various issues occuring with encoding on Windows.
The data you are reading there from your shell, however, is not UTF-8 encoded. Instead, it appears your shell uses some different encoding, e.g. cp850.
A possible workaround would be to instruct Ruby to assume the locale encoding of your environment which you can set with the -E switch on the command invocation, e.g.:
irb -E locale
or by setting Encoding.default_external manually in your script to the correct encoding of your environment.

On Turkish windows PC's cmd shell uses encoding of CP857
You can see it at cmd > preferences section
Here is the practice solution with contributions of Holger.
irb(main):005:0> x = gets.chomp
Here is the Turkish chars ğĞüÜşŞiİıIöÖçÇ
=> "Here is the Turkish chars \xA7\xA6\x81\x9A\x9F\x9Ei\x98\x8DI\x94\x99\x87\x80"
irb(main):006:0> x.force_encoding "CP857"
=> "Here is the Turkish chars \xA7\xA6\x81\x9A\x9F\x9Ei\x98\x8DI\x94\x99\x87\x80"
irb(main):007:0> x.valid_encoding?
=> true
irb(main):008:0> x.encode("UTF-8", undef: :replace)
=> "Here is the Turkish chars ğĞüÜşŞiİıIöÖçÇ"

Related

How to consistently get ASCII-8BIT encoding in Ruby?

Ruby seems a bit inconsistent in its handling of encodings:
irb -E BINARY:BINARY
irb(main):001:0> "hi".encoding
=> #<Encoding:ASCII-8BIT>
So that "works". Now what about plain ruby?
ruby -E BINARY:BINARY -e 'p "hi".encoding'
#<Encoding:US-ASCII>
That doesn't work. Furthermore, when p "hi".encoding is placed in x.rb, the output of ruby -E BINARY:BINARY x.rb is:
#<Encoding:UTF-8>
How do I get ASCII-8BIT literals when invoking ruby?
String literals have the same encoding as the script encoding. Instead of 'hi'.encoding you can use the keyword __ENCODING__ to retrieve it. The script encoding can be changed by putting a magic comment at the beginning of your script:
# encoding: ASCII-8BIT
p __ENCODING__ # => #<Encoding:ASCII-8BIT>
The -E flag of ruby doesn't affect the encoding of string literals. It's only for changing the external and internal encoding. You can read about the various type of encodings and their purpose in the Encoding documentation.
Back to the encoding of string literals: Even though irb claims its -E flag is the "Same as ruby -E" that isn't true. It uses the external encoding as script encoding. irb already has several limitations. This could be one of them. It's at least a documentation bug.
Besides the magic comment there's another discouraged way to set the script encoding via ruby: the -K flag and the n (none) kcode. ruby -Kne "p __ENCODING__" should print #<Encoding:ASCII-8BIT>. However -K also changes the external encoding.

Can I programmatically convert "I’d" to "I’d" using Ruby?

I can't seem to find the right combination of String#encode shenanigans.
I think I'd got confused on this one so I'll post this here to hopefully help anyone else who is similarly confused.
I was trying to do my encoding in an irb session, which gives you
irb(main):002:0> 'I’d'.force_encoding('UTF-8')
=> "I’d"
And if you try using encode instead of force_encoding then you get
irb(main):001:0> 'I’d'.encode('UTF-8')
=> "I’d"
This is with irb set to use an output and input encoding of UTF-8. In my case to convert that string the way I want it involves telling Ruby that the source string is in windows-1252 encoding. You can do this by using the -E argument in which you specify `inputencoding:outputencoding' and then you get this
$ irb -EWindows-1252:UTF-8
irb(main):001:0> 'I’d'
=> "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d"
That looks wrong unless you pipe it out, which gives this
$ ruby -E Windows-1252:UTF-8 -e "puts 'I’d'"
I’d
Hurrah. I'm not sure about why Ruby showed it as "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d" (something to do with the code page of the terminal?) so if anyone can comment with further insight that would be great.
I expect your script is using the encoding cp1251 and you have ruby >= 1.9.
Then you can use force_encoding:
#encoding: cp1251
#works also with encoding: binary
source = 'I’d'
puts source.force_encoding('utf-8') #-> I’d
If my exceptions are wrong: Which encoding do you use and which ruby version?
A little background:
Problems with encoding are difficult to analyse. There may be conflicts between:
Encoding of the source code (That's defined by the editor).
Expected encoding of the source code (that's defined with #encoding on the first line). This is used by ruby.
Encoding of the string (see e.g. section String encodings in http://nuclearsquid.com/writings/ruby-1-9-encodings/ )
Encoding of the output shell

What encoding are Ruby Strings in?

Is it true that Ruby Strings are just a sequence of Unicode characters? If so, what specific encoding e.g. is it UTF-8, etc.?
The default encoding of a String is the same as the source file.
The default encoding of the source file is UTF-8 in Ruby 2.0 or later, or US-ASCII in Ruby 1.9 or earlier. You can specify the encoding by adding
# encoding: utf-8
in the beginning of a source file.
By default, Ruby strings are indeed UTF-8, as can be verified by the String#encoding method:
llama#llama:~$ irb
irb(main):001:0> 'foo'.encoding
=> #<Encoding:UTF-8>
You can get a list of available encodings via Encoding::list:
irb(main):002:0> Encoding.list
=> [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>, #<Encoding:US-ASCII>, (etc...)]
And change the encoding of a string with String#force_encoding:
irb(main):003:0> 'foo'.force_encoding(Encoding::US_ASCII).encoding
=> #<Encoding:US-ASCII>

Encoding issue when using Nokogiri replace

I have this code:
# encoding: utf-8
require 'nokogiri'
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
puts "Original string: #{s}"
#doc = Nokogiri::HTML::DocumentFragment.parse(s)
links = #doc.css('a')
only_text = 'Café Verona'.encode('UTF-8')
puts "Replacement text: #{only_text}"
links.first.replace(only_text)
puts #doc.to_html
However, the output is this:
Original string: <a href='/path/to/file'>Café Verona</a>
Replacement text: Café Verona
Café Verona
Why does the text in #doc end up with the wrong encoding?
I tried with and without encode('UTF-8') or using Document instead of DocumentFragment, but it's the same problem.
I'm using Nokogiri v1.5.6 with Ruby 1.9.3p194.
Seems that if you pass a nokogiri text object it does the thing ;)
links.first.replace Nokogiri::XML::Text.new(only_text, #doc)
I can't duplicate the problem, but I have two different things to try:
Instead of using:
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
Try:
s = "<a href='/path/to/file'>Café Verona</a>"
Your string is already UTF-8 encoded, because of your statement # encoding: utf-8. That's why you put that in the script, to tell Ruby the literal string is in UTF-8. It's possible that you're double-encoding it, though I don't think Ruby will -- it should silently ignore the second attempt because it's already UTF-8.
Another thing I wonder about is, output like:
Café Verona
is an indicator that the language/character-set encoding of your system and your terminal aren't right. Trying to output UTF-8 strings on a system set to something else can get mismatches in the terminal and/or browser. Windows systems are typically Win-1252, ISO-8859-1 or something similar, not UTF-8. On my Mac OS system I have these environment variables set:
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
"Open iso-8859-1 encoded html with nokogiri messes up accents" might be useful too.

Ruby: ARGV breaks accented characters

# encoding: utf-8
foo = "Résumé"
p foo
> "Résumé"
# encoding: utf-8
ARGV.each do |argument|
p argument
end
test.rb Résumé > "R\xE9sum\xE9"
Why does this occur, and how can I get ARGV to return "Résumé"?
I have chcp 65001 set already and am using ruby 1.9.2p290 (2011-07-09) [i386-mingw32]
EDIT After asking around on irc, I was instructed to do chcp 1252>NUL which fixed the problem.
For some reason, Windows doesn't use UTF-8 in your console. So, although Ruby expects UTF-8 encoded string, it gets Windows-1252 encoded string.
So you have several possibilities (which I can't test as I, fortunately, don't use Windows):
Persuade Windows to use UTF-8 in your console. I don't know if chcp should work and, if so, why it doesn't.
Tell Ruby to use Windows-1252 instead of UTF-8 as default
Convert ARGV from Windows-1252 to UTF-8 manually:
Example:
>> argument = "R\xE9sum\xE9"
=> "R\xE9sum\xE9"
>> argument.force_encoding('windows-1252').encode('utf-8')
=> "Résumé"

Resources