tcl: how to print UTF-8 text to stdout on Windows - windows

my little cross-platform Tcl/Tk pet-project uses utf-8 as the text encoding everywhere.
encoding system utf-8
fconfigure stderr -encoding utf-8
fconfigure stdout -encoding utf-8
#puts "汉语"
puts "foo"
This works great on Linux and macOS, where the standard encoding is UTf-8:
$ tclsh utf8test.tcl
foo
Unfortunately, it doesn't work at all on Windows, where the terminal typically uses cp1252, and it consequently outputs garbage:
$ wine tclsh85.exe utf8test.tcl
潦൯
As far as I understand, this is because we cannot really change the encoding of the Windows terminal, and - according to this sf issue - I need to detect myself whether the output is going to a terminal or somewhere else.
I have no idea how to make this decision.
So how can I output printable characters to the console in a project that otherwise uses UTF-8 everywhere?
edit1
my original example would output some high-unicode points (chinese characters 汉语).
however, the problem also persists if I try to output a single ASCII string (foo).
since most of the text i'm outputting is ASCII anyhow, I would like at least to fix the output for that.
i thought that cp1252 would be backwards compatible with ASCII...

Output to the real console on Windows is special — it has it's own channel implementation — since the low-level API takes Unicode (I think it is actually UTF-16, but that isn't important here) and Tcl uses the low-level IO API on Windows where it can. I do not know whether Tcl has automatically detected your console correctly. I know absolutely nothing about what interaction there is between the API and Wine's implementation of it and whatever is going on in the host environment.
Console output is supposed to Just Work™ if you don't modify the encoding on its channel. Indeed, the usual rule is don't change encodings except on files and sockets (where you know the encoding might be not the system one).

Related

RStudio: keeping special characters in a script

I wrote a script with German special characters e.g. ü.
However, whenever I close R and reopen the script the characters are substituted:
Before "für"; "hinzufügen"; "Ø" - After "für"; "hinzufügen"; "Ã".
I tried to remedy it using save with encoding and choosing UTF-8 as it is stated here but it did not work.
What am I missing?
You don't say what OS you're using, but this kind of thing really only happens on Windows nowadays, so I'll assume that.
The problem is that Windows has a local encoding that is not UTF-8. It is commonly something like Latin1 in English-speaking countries. I'm not sure what encoding people use in German-speaking countries, if that's where you are. From the junk you saw, it looks as though you saved the file in UTF-8, then read it using your local encoding. The encodings for writing and reading have to match if you want things to work.
In RStudio you can try "Reopen with encoding..." and specify UTF-8, and you'll probably get your original back, as long as you haven't saved it after the bad read. If you did that, you've got a much harder cleanup to do.

Is there a UTF-8 replacement for BASH?

I often need to write simple BASH scripts on my computer for manipulating files. BASH seems to have difficulty working with UTF-8 content.
Are there any versions of BASH which are fully UTF-8 compatible?
Is there a replacement for BASH, which uses a similar or identical syntax, but is UTF-8 compatible?
Bash itself shouldn't have any problems using UTF8. Most likely your problems are caused by another program, e.g. the terminal emulator or editor. Make sure that your environment is set up to use UTF8. For more information on this, see for example here.
I take your problem is the usual sed/awk/grep... etc doesn't support unicode, so stackoverflow solutions usually don't work for you?
bash itself is very limited without external programs.
To do what you want, you probably have to code in a more functional programming language other than bash.
UTF-8 itself is not very suitable for processing, you need to parse it into 2-byte or 4 byte character and then process the characters. (i.e. conversion to UTF-16 or UTF-32) and then convert it back to UTF-8 for storage.

Batch convert to UTF8 using Ruby

I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

Is there anyway for a bash (or any other shell) script to detect whether the current terminal supports unicode characters?

I'd like to use Unicode characters if they are supported by the terminal, and fall back to ASCII characters if the user's terminal can't display them correctly. Is there any relatively easy way to do this in a shell script?
First, you're probably confusing Unicode with a particular encoding. Suppose you know that the termnal supports Unicode characters -- you still don't know how to print them!
You're probably thinking about something like UTF-8, the most popular Unicode encoding out there.
To get the encoding of the current locale, use
locale charmap
This is the encoding of the current locale, and theoretically it may differ from the encoding used by the terminal, but in that case something is broken on user's side.
In script print
:set encoding=utf-8
If you want your terminal support unicode, become new terminal with -u8 option
type in terminal xterm -u8

Unicode characters in a Ruby script?

I would like to write a Ruby script which writes Japanese characters to the console. For example:
puts "こんにちは・今日は"
However, I get an exception when running it:
jap.rb:1: Invalid char `\377' in expression
jap.rb:1: Invalid char `\376' in expression
Is it possible to do? I'm using Ruby 1.8.6.
You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.
What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.
Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.
And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):
puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"
However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.
So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:
puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"
But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).
(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)

Resources