Ruby and Accented Characters - ruby

Summary of the wall of text below: How can I display accented characters (so they work via puts, etc) in Ruby?
Hello! I am writing a program for my class which will display some sentences in Spanish. When I try to use accented characters in Ruby, they do not display correctly (in the NetBeans output window (which displays accented characters in Java fine) or in the Command Prompt).
At first, some of my code didn't even run because the accented characters in my arrays where throwing off the Ruby interrupter (I guess?). I got errors like Ruby was expecting a closing bracket.
But I did some research, and found a solution, to add the following line of code to the beginning of my Ruby file:
# coding: utf-8
In NetBeans, my program ran regardless of this line. But I needed to add this line to get my program to run successfully in Command Prompt. (I don't know why.)
I'm still, however, having a problem actually displaying the characters to the screen. A word such as "será" will display in the NetBeans output window as "seré". And in the command prompt it draws little pipe characters (that I don't know how to type).
Doing some more research, I heard about:
$KCODE = 'UTF-8'
but I'm not having any luck with this.
I'm using Ruby 1.8 and 1.9 (I go back and forth between different machines).
Thanks,
Derek

A command prompt in Windows 7 has raster fonts by default. And it doesn't support unicode. At first, you should change cmd font to Lucida Console or Consolas. And then change the command prompt's codepage with chcp 65001. You can do it manually or add this line to your ruby programm:
# encoding: utf-8
`chcp 65001` #change cmd encoding to unicode
puts 'será test '

Related

What is the Windows command line parameter encoding?

What encoding does Windows use for command line parameters passed to programs started in a cmd.exe window?
The encoding of command line parameters doesn't seem to be affected by the console code page set using chcp (I set it to UTF-8, code page 65001 and use the Lucida Console font.)
If I paste an EN DASH, encoded as hex E28093, from a UTF-8 file into a command line, it is displayed correctly in the cmd.exe window. However, it seems to be translated to a hex 96 (an ANSI representation) when it is passed to the program. If I paste Cyrillic characters into a command line, they are also displayed correctly, but appear in the program as question marks (hex 3F.)
If I copy a command line and paste it into a text file, the resulting file is UTF-8; it contains the same encoding of the EN DASH and Cyrillic characters as the source file.
It appears the characters pasted into the cmd.exe window are captured and displayed using the code page selected with chcp, but some ANSI code page is used to translate the characters into a different encoding before passing them as parameters to a program. Characters that cannot be converted apparently are silently converted to question marks.
So, if I want to correctly handle command line parameters in a program, I need to know exactly what the encoding of the parameters is. For example, if I wish to compare command line parameters with known UTF-8 data read from a file, I need to convert the parameters from the correct encoding to UTF-8. Thanks.
If your goal is to compare Unicode characters then you should call GetCommandLineW in your program (or use wmain so that argv uses wchar_t) and then convert this UTF-16LE command line string to UTF-8 or vice versa.
GetCommandLineA probably converts the Unicode source string with CP_ACP.

problems in Perl script with localized user name on Windows

I have a perl script to start a script file with default program.
system("Start C:\\Temp\\test.jsx");
It works file with English user names but when I change user name to ai𥹖Ц中 it doesn't work.
Also no error message appears to I'm not able to debug.
perl on Windows uses so called ANSI functions to interface with the outside world. That means, if you use interesting characters (for example, certain Turkish letters on a US-English Windows install), perl cannot see them. As I wrote on my blog:
You can't pass characters that are outside of the Windows code page to perl on the command line. It doesn't matter whether you have set the code page to 65001 and use the -CA command line argument to perl: Because perl uses main instead of wmain as the entry point, it never sees anything other than characters in the ANSI code page.
For example:
$ chcp 65001
$ perl -CAS -E "say for #ARGV" şey
sey
That's because ş does not appear in CP 437 which is what my laptop is using. By the time it reaches the internals of perl, it has already become s.
So, there is not much you can do with a stock perl. I was working on a set of patches, but things intervened. I may still get around to it this summer.
Now, in your case, you are passing "interesting characters" to an external program via system. The same problem applies. Because perl is using the ANSI versions of functions used to spawn processes etc, the program spawned will not see a Unicode environment. So, if you are trying to use Korean or Japanese programs with a system code page that does not include them, I am not sure what will happen.
There is not much you can do once perl is running. The environment, command line arguments, everything lives in the ANSI world from that point on. There may be funky work-arounds, but for that one would need to know exactly how 'ai𥹖Ц中' gets from your perl program to the external program.

IRB won't read input after pasting over 1565 characters on Windows?

Though I can paste any size string into Ruby's IRB prompt if it's running in a Unix shell (like Bash on Mac or Linux), when I try to paste my clipboard contents into IRB running on a Windows command/PowerShell prompt, if my clipboard contains larger than 1,565 characters IRB fails to receive the input. Not only that, after the paste attempt, IRB will not receive any further input from my keyboard (stdin).
Does anyone else have this issue or know a fix to it so I can paste longer strings into IRB? If you'd like to recreate it, try to paste this integer (1,566 characters long) into the IRB prompt:
012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890012345678901234
I have recreated this issue on two Windows 7 machines, one running Ruby 2.1.8p440 and one running Ruby 2.2.0.
EDIT:
Sometimes I need to increase the size of the string larger than 1,566-1,700 characters to recreate the bug, but it never seems to handle anything larger than 2000 characters.

Windows Raster Fonts Encoding Error

I'm writing an interpreter and I have come across a peculiar problem involving character sets. ( I think ).
When I create a file on my Mac called, hello.rd and I run the command;
file -I hello.rd
I get this output:
hello.rd: text/plain; charset=utf-8
That shows me the file is UTF-8 which it should be. The source file looks like this;
print "Hello World á"
And the output in the terminal is:
Hello World á
This is all the way I want / expect it to be. The problem arises when I execute the code on Windows. When I execute the same code on Windows I get this output:
As you can see the á isn't output correctly. I changed the codepage to 65001 and it made no difference, but when I used the Lucida Console font, the characters displayed correctly. But what I can't understand is, why I can type the letter á in the terminal using my keyboard and it displays, but it won't display from my files.
So what I did next was I created a file on my Windows PC called test123.rd and saved this text in it:
print "Hello World á ã ß"
When I execute that on my Mac I get the incorrect output this time, I get:
Hello World ? ? ?
And on my PC I still get the incorrect output, I get this:
I used the file -I command on my Mac on the file test123.rd and I got this output:
test123.rd: text/plain; charset=iso-8859-1
I assume since the character set in the test123.rd file isn't UTF-8, is why the file test123.rd is displaying incorrectly on OSX but I don't understand why it's displaying incorrectly on Windows as well.
Does anyone have any idea how to solve the problem, without changing the font of the Windows CMD?
Type cmd /? to see how to switch unicode on, then choose a unicode font. Also see chcp /?.

Perl on Windows: Problems with Encoding

I have a problem with my Perl scripts. In UNIX-like systems it prints out all Unicode characters like ä properly to the console. In the Windows commandline, the characters are broken to senseless glyphs. Is there a simple way to avoid this? I'm using use utf8;.
Thanks in advance.
use utf8; simply tells Perl your source is encoded using UTF-8.
It's not working on unix either. There are some strings that won't print properly (print chr(0xE9);), and most that do will print a "Wide character" warning (print chr(0x2660);). You need decode your inputs and encode your outputs.
In unix systems, that's usuaully
use open ':std', ':encoding(UTF-8)';
In Windows system, you'll need to use chcp to find the console's character page. (437 for me.)
use open ':std', ':encoding(cp437)'; # Encoding used by console
use open IO => ':encoding(cp1252)'; # Encoding used by files

Resources