Can I avoid using CP1252 on Windows? - utf-8

I would like all my toolkit to use UTF-8 but find that some tools on Windows seem to use CP1252 (which appears to be Windows-specific). Does this create output which is incompatible and if so at which codepoints? If so, can I do anything about it?
(I don't completely understand the issues so I'd be grateful for basic education on these encodings).

Tools hard-coding for code page 1252 on Windows is very unlikely. Much more likely is that it happens to be the default code page on your machine. 1252 is used in Western Europe and the Americas. It is configured in Control Panel, Regional and Language options. They've been using different names for it, on Win7 it is in the Administrative tab, Change System Locale.
Yes, many tools use the default code page unless they have a good reason to chose another encoding. The BOM is such a good reason. Notable examples are Notepad (unless you change the Encoding in the File + Open dialog to something else than Ansi) and C/C++ compilers. There typically isn't anything special you need to do to use the default code page. Guessing the correct code page for a text file when you don't have a BOM is impossible to do accurately. Google "bush hid the facts" for a very amusing war story.

Six years old and still relevant: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Now, about your question: Yes, there are still tools out there that choke on UTF-8 files. But more and more tools are "getting it". If you're developing your own stuff, you might want to look into Python 3 where all strings are Unicode. The philosophy is to convert all your inputs into Unicode (if necessary) as early as possible, and reconvert them to a target encoding as late as possible. There are toolkits out there that will do a good job of guessing the encoding of a particular file (for example, Mark Pilgrim's chardet, a port of Mozilla's encoding detector). This is nice if you're working with files that don't specify an encoding.

CP1252 and UTF-8 are the same for all characters < 128. They differ above that. So if you stick to English and stay away from diacritical marks these will be the same.
Most of the Windows tools will use whatever is set as the current user's current codepage, which will default to 1252 for US Windows. You can change that to another codepage pretty easily. But UTF-8 is NOT one of the available codepage options for Windows. (I wish it was).

Some utilities under Windows will understand the UTF-8 byte-order mark at the start of a file. Unfortunately I don't know how to determine if this will work except to try it.

UTF-8 is supported on Windows but not as a current codepage. You can use UTF-8 for converting to/from it but you cannot set is as current codepage.
First do not try to waste time by setting the codepage - this approach will remind you of Sisyphus myth - you can't really solve the problem using codepages, you have to use Unicode.
The only real solution for you is to build your application as Unicode so it will use UTF-16 and to convert to/from UTF-8 on in/out operations. This is done quite simple because fopen supports reading or writing UTF-8.
Regarding the usage of other Windows tools with UTF-8 file, you should not be aware because if the tool is able to work with ASCII it will work with UTF-8 (even so it may not be able to distinguish between Unicode chars but at least it will be able to load/parse the files).
BTW, You forgot to specify what programming language are you using and what Windows tools are you considering for usage.
Also, if you ware interested about more internationalization stuff please visit my blog.i18n.ro

Related

Encodings: OS, keyboard input etc

I'm trying to understand how operating systems deal with different encodings. I have read that Windows uses UTF-16 internally. If I type text into a text editor, is it going to be saved in UTF-16 on the hard disk on Windows? Is the text typed in (and temporarily stored in RAM until saved) encoded in the OS's internal encoding unless explicitly converted by a text editor with such capability? If I create a UTF-8 SQL database, but fill it with text using my keyboard on Windows, is the OS pushing UTF-16 encoded text inside or does the system at some point realise that it should be UTF-8? When I make webpages I'm told that it's best to use UTF-8. So I make sure that my text editors are set to that, but how do I know that the input from the keyboard/OS is UTF-8?
The Unicode charset can be represented with many encodings, chief among them: utf-8, utf-16 and utf-32. The software you use will translate between different charsets and encodings internally as appropriate, though you might have to select the input and output encodings for permanent storage (files) yourself.
Because Windows created a whole new suite of wide APIs when people thought UTF-16 would stay UCS2 forever, most Windows components internally use UTF-16. Also, the narrow APIs are generally more restricted, where they exist at all, and generally use some obsolete ANSI codepage, not UTF-8. Thus, if your editor was originally developed on Windows, or uses a Windows standard control for text display and editing, your text will (mostly transparent to you) be UTF-16 in memory.
Most text editors (including Windows Editor) still save text per default as UTF-8. Microsoft editors (and much more windows software) though tend to prefix any Unicode text file with a BOM, which can make non-BOM-aware software choke.

Script for correcting linux filenames on windows and vice versa?

As far as I know, Unix-like systems use UTF-8 for encoding filenames, while Windows system use their own Windows single-byte encodings.
I am working with archives with japanese filenames in them quite oftenly. When I open such archive created in Windows, japanese letters are dead, because filename encoding is incorrect.
Same thing happens, when I create archive in my Linux and then someone opens it under Windows.
So, I have thought that this should be quite common problem, and, because filenames are recoverable there must already exist correcting .sh script for linux and .bat script for Windows.
But after googling for quite a long time I still have not found anything.
Is there such scripts at all? If not, what difficultuies may have stopped people from creating them?
Update
I would be happy with a solution that works for most Linux systems and most Windows systems.
Windows uses the two byte encoding UTF-16. Your problem is most likely that you are using single byte ANSI versions of whatever archive tool you are using.
Until you give more details of the code and tools you are using it's hard to give specific advice. However, the are no limitations on using the full range of Unicode characters in modern Windows file systems.
Thank you for your input. Case indeed looks quite complex for simple bash script, I'll need to use programming language.
I don't see anything like "close question" button, so I'll use this answer to do so.
Take a look to convmv tool available for Unix systems

How will I convert characters? Or other solutions

I found out (though my other question) that my IME outputs Hangul Compatibility Jamo (U+3130 – U+318F) instead of regular Hangul Jamo(U+1100 – U+11FF).
So I tried asking a question in superuser about other IMEs, no replies yet.
Should I just convert it myself? What exactly does that entail? Is it too complicated? Any ideas on how to? Any help would be appreciated.
Language: Delphi
OS: WinXP
IME: Korean Input System (IME 2002)
There is no reason you could not write an interesting experimental editor control with its own built in Unicode Compose feature. However, before you did that, you might look for a way to change the configuration of the IME. This seems to be a really interesting corner-case you have to work with. I was already surprised about your other question - that Windows has the ability to handle Raw Input from keyboards.
I found that source code for something that says it is the Korean IME is available for Windows CE. You might learn something by studying it, even though it is for Windows CE rather than XP.
http://msdn.microsoft.com/en-us/library/ee491900.aspx

Why isn't UTF-8 allowed as the "ANSI" code page?

The Windows _setmbcp function allows any valid code page...
(except UTF-7 and UTF-8, which are not supported)
OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.
But why not UTF-8?
As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?
The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.
_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.
Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.
The reason is exactly like what was said in jamesdlin's answers and the comments below it: MBCS is the same as DBCS in Windows and some functions don't work with characters that are longer than 2 bytes
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
So UTF-8 was allowed in functions like read/write but not when using as a locale
However Microsoft has finally fixed that so now we can use UTF-8 as a locale. In fact MS even started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before. There are some new options in MSVC: /execution-charset:utf-8 and /utf-8 to set the charset, or you can also set the ActiveCodePage property in appxmanifest of the UWP app
Since Windows 10 insider build 17035, before those options were introduced, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox had also been added for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocale() to change to UTF-8 locale:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
UTF-8 Support
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
See also
Is it possible to set “locale” of a Windows application to UTF-8?

i18n shell in windows

Is there an i18n shell in windows that supports a large character set? Testing my application in windows results in some math characters not being rendered correctly. The Lucida font in cmd.exe and powershell do not have a wide enough selection.
Unicode UTF-8 would be the most preferable, followed by the other Unicode encodings.
I'm not sure if this is a problem in the font or the console itself but you could try installing the DejaVu Sans Mono font and see if that provides the necessary characters.
CMD.EXE supports it just fine; the issue is that it is doesn't allow a whole lot of other fonts by default and Lucida Console, usually the only TrueType font there, has no fonts defined in the font fallback chain. See http://www.siao2.com/2008/03/19/8323216.aspx and the screenshots I link to in the comments for that blog post.
You may want to see http://www.siao2.com/2006/10/19/842895.aspx on how to make more fonts appear amongst those you can choose as the main console font.
Also, make sure that your application really uses a Unicode codepage for its output - http://illegalargumentexception.blogspot.com/2009/04/i18n-unicode-at-windows-command-prompt.html probably explains the issue better than I could (or, at the very least, as well as I could).
I just found the ActiveState Tcl does a really good job with tkcon.
When starting tkcon.tcl, I just have to type:
encoding system utf-8
It works well and even has tab completion. Of course, it is a Tcl shell and not a system shell.
It seems to be able to find characters for all of the symbols I am currently using in the test suite for my application.
While working under Windows, I use the DejaVu Sans Mono font along with Console for getting better Unicode (UTF-8) support.

Resources