Why does Windows use ANSI Code page instead of UNICODE? - windows

When I run the command chcp in a cmd.exe window, it represents the code page used in Windows.
I think Windows uses the UNICODE character set.
So, my questions are:
Why does Windows use ANSI codepages instead of Unicode?
Windows uses UTF-16 or UCS-2? Can I check this (by command or MSDN link)?
UTF-16 or UCS-2 is just an encoding? or is also a character set?
UTF-8, UTF-16, UTF-32, etc .. do they have different character set size?
I'm so confused. please somebody define them.

Historical reasons, and backwards compatibility. Windows itself is a Unicode-based OS, and has been since the NT days. But many legacy (and even current) apps are not written for Unicode. Unicode-enabled apps do not use ANSI codepages, unless they need to convert runtime data between ANSI and Unicode.
Microsoft switched to UTF-16 in Windows 2000. Before that, it used UCS-2. See Unicode in Microsoft Windows.
Both UTF-16 and UCS-2 are just encodings of the same Unicode character set. UTF-16 was invented to support encoding codepoints above U+FFFF, which UCS-2 cannot handle.
All UTFs (including many you haven't named) are just encodings of the same Unicode character set. The number specified in the name is the number of bits used in encoded codeunits (UTF-8 uses 8bit codeunits, UTF-16 uses 16bit codeunits, etc).

Related

Unicode filenames on FAT-32?

As far as I understand - NTFS supports Unicode filenames (UTF-16 as Micorsoft claims?).
But official MSDN documentation is very vague regarding what codepage(s) is used to store filenames (filepaths) on FAT-32.
Here it says that OEM code page (CP437 I assume) is used to store filenames: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748.aspx
But here it turns out that there can be different OEM codepages with CP437 being one of them: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752.aspx
And we all now that utilities like mount support many more different codepages for FAT, more than just OEM codepages set.
So what is the actual cdepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created? Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
And more specific question:
What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
You might have to experiment here. This is a great question, and I'm not 100% confident, but:
So what is the actual codepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created?
The "OEM codepage", whatever that is for the system.
Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
No, I don't believe FAT is directly capable of either UTF-16 or UTF-8. That said, Microsoft stores the Unicode filename in an out of band method. A file thus has two filenames. (This is how you can have longer than 8.3 character filenames, as well.)
And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
The Unicode filename, as passed to CreateFileW is stored directly in the out of band filename. It is re-encoded into the OEM codepage (whatever that happens to be on the system) and is put there. If it cannot be converted into the OEM codepage, or exceeds 8.3 characters, Windows will call the file something like, FILENA~1.TXT.
Some citations for these answers:
First, this page tells us that the OEM code page != the Windows code page:
Non-Unicode applications that create FAT files sometimes have to use the standard C runtime library conversion functions to translate between the Windows code page character set and the OEM code page character set. With Unicode implementations of the file system functions, it is not necessary to perform such translations.
On a typical American system, the OEM code page is "CP437", but the Windows code page is Windows-1252 (The FooA calls, I believe, use the Windows code page, typically Windows-1252 on an American machine, but depends on locale).
If you have a FAT volume available, you can see this in action. The character "Σ" (U+03a3) is not present in Windows-1252, however, it is in CP437. You can see both the short and long filenames with dir /X. With a file named asdfΣ.txt, you'll see:
ASDFΣ.TXT asdfΣ.txt
However, with a file named "asdfΛ.txt" (Λ is not present in either CP437 or Windows-1252), you'll see:
ASDF~1.TXT asdf?.txt
(You'll likely see ?, because cmd.exe's font cannot display a Λ.)
For information about long filenames, see this Wikipedia article.
Also, interestingly, if you name a file "asdf©.txt", you might get:
ASDFC.TXT asdfc.txt
… I'm not 100% sure here, but I think Windows cleverly decided to substitute "c" for ©, and did likewise for displaying it. If you change the font to something not raster based, like Consolas, you'll see:
ASDFC.TXT asdf©.txt
And this is why you should use the FooW functions.
The basic FAT or FAT32 directory entries support only short names (the old DOS 8.3 format) in the current OEM codepage. However, VFAT (FAT with long filename support) which is used while under Windows, can store an additional, so-called long filename for each file, in UTF-16.

Ensuring consistency when encoding to UTF8 from extended ASCII

Maybe this is a non-issue but I look to the collected wisdom of SO to help me find out.
We're trying to ensure encodings are consistent across platforms. The way to go is clearly UTF8. However, some platforms unfortunately use extended ASCII (typically some form of Windows codepage), We're concerned that when encoding something with say, an umlaut, from a Windows codepage to UTF8, there are multiple possible choices within UTF8 for the character.
On a different platform (Linux, Mac OS), how do we ensure that the UTF8 character chosen there is consistent?
As I said, maybe this is a non-issue. Maybe there is some standard mapping I'm unaware of. We haven't seen any problems but a colleague just raised the concern so I'm on the hunt for information.
Thank you all in advance.
As long as you properly convert original text to Unicode first and than use Utf8 to store/transfer data there should be no problems.
The Unicode Consortium has compiled a set of mapping tables. Nominally informational, they constitute a de facto standard. Moreover, many of the mappings there reflect formal standards, as it has become normal to define any new character encoding in terms of Unicode, i.e. by specifying the Unicode number (and/or Unicode name) of each character.
Once a character has been mapped to Unicode (i.e., to a Unicode code point, or Unicode number), its encoding in each Unicode encoding, such as UTF-8, has been defined unambiguously.
So the issue is how you ensure that the conversion routines you use work according to those tables. Using ICU can be regarded as safe in this respect.
P.S. There is no extended ASCII. There are various character encodings, some of which coincide with ASCII in the range from 0 to 0x7F, some don’t.

"Windows uses UTF-16 as its internal encoding", what exactly does this mean?

Excuse me if the question is stupid, it's kind of confused me, suppose I have a application(no matter C, C++,.NET or Java) on my Windows XP, and this application will get data from a remote machine, the data contain Chinese characters, now if Chinese characters become junk, is it correct to say that Windows has nothing to do with this issue? because Windows uses UTF-16, and can handle Chinese characters properly.
On the other hand, suppose Windows uses ASCII as its internal encoding, does this mean that any applications on it can never display Chinese characters correctly?
Thanks in advance.
The Windows NT kernel uses UNICODE_STRING for many (or is it most?) named objects (e.g. files). The encoding is UTF-16.
Many of user-mode callable APIs expose pairs of almost identical functions, where one in the pair accepts Unicode strings and, the other, ANSI strings. The ANSI string versions end up converting names from ANSI to Unicode.
For example, when you call C's fopen() function, which accepts 8-bit non-Unicode file names, it ends up invoking CreateFileA() (ANSI), and that eventually calls NtCreateFile(), which accepts Unicode file names. One of NtCreateFile()'s parameters, the OBJECT_ATTRIBUTES structure, contains a pointer to a UNICODE_STRING structure.
If you, on the other hand, call MSVC++'s _wfopen() function, it will reach NtCreateFile() through CreateFileW() (Unicode) without the conversion.
To store any text in memory and display it on screen, the OS needs to handle that text in some encoding behind the scenes. What encoding that is specifically shouldn't matter to you. It could handle it as HTML encoded ASCII for all you know, as long as the APIs accept certain text and it outputs the right thing.
"Windows uses UTF-16 internally" means Windows happens to store and handle text internally as UTF-16. It also supports Chinese text. These two things aren't necessarily connected. Yes, using UTF-16 internally makes it easier to support Chinese, which is probably why the Windows engineers chose to go with UTF-16.

Windows API: ANSI and Wide-Character Strings -- Is it UTF8 or ASCII? UTF-16 or UCS-2 LE?

I'm not quite pro with encodings, but here's what I think I know (though it may be wrong):
ASCII is a 7-bit, fixed-length encoding, with the characters you can find in ASCII charts.
UTF8 is an 8-bit, variable-length encoding. All characters can be written in UTF8.
UCS-2 LE/BE are fixed-length, 16-bit encodings that support most common characters.
UTF-16 is a 16-bit, variable-length encoding. All characters can be written in UTF16.
Are those above all correct?
Now, for the questions:
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
(I had more questions, but this is enough... I forgot some of them anyway...)
These are a lot of questions, so any links to explanations about how all these connect (aside from reading the Unicode standard, which won't help with the Windows API anyway) would also be greatly appreciated.
Thank you!
Are those above all correct?
Yes, if you don't assume the existence of characters not encoded in Unicode (for most practical applications, this assumption is fine).
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
They take byte strings (i.e., strings whose code unit is a byte, which is always an octet on Windows) encoded in the current "ANSI"/MBCS/legacy encoding. "ANSI" is the historical terms for these encodings, but not correct. For Western Windows systems, this encoding is usually Windows-1252.
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
Since Windows 2000, most of them support UTF-16. The name "wide" and the rest of the Microsoft terminology (e.g., "Unicode" meaning "UTF-16" or "UCS") were chosen before the modern Unicode standard unified the terminology.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Every other encoding that WideCharToMultiByte supports is a "multi-byte encoding" in this context, including Windows-1251 and UTF-8.
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
LPWSTR is a pointer to wchar_t which is always a 16-bit unsigned integer on Windows. Which characters can be displayed is unrelated to the encoding as long as that encoding can encode all Unicode characters. Windows is generally able to display non-BMP characters, but not everywhere (e.g., the console cannot).
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Don't really know, but I don't think they differ too much. I suppose you just try to convert some non-BMP character to UTF-8 and look whether the result is correct.
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
File paths are indeed opaque arrays of UTF-16 characters, meaning that Windows doesn't perform any kind of translation when storing or reading file names (like Linux and unlike Mac OS X). But Windows still has its weird mostly-undefined case insensitive behavior which causes much trouble because file names that are treated equivalent aren't necessarily equal. That breaks many invariants; for example, on Linux without interference from other threads, if you successfully create two files A and a in some directory, you'll end up with two distinct files, while on Windows you get only one file (and in general, an unpredictable number of files).
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
ANSI is the American standardization organization. Using this word when referring to encodings is a misnomer, but a frequent one, so you should be aware of it. I prefer the term legacy 8-bit encoding, because I think that's essentially what it is: a non-Unicode encoding that is kept only for compatibility with legacy (Windows 9x) applications. On Western systems, this is usually Windows-1252, which is a proper superset of ASCII.
Wide strings used to be UCS-2. From Windows 2000, wide strings are UTF-16. Good to know if you need to maintain some old legacy system.
*A functions used the active ANSI codepage.
*W function use UTF-16.
Multi-byte refers to whatever is passed in the CodePage parameter. It is most commonly either the active ANSI codepage or UTF-8.
LPWSTR is a UTF-16 string which may or may not be null-terminated (see MSDN)
I don't know anything about wcstombs, I always use WideCharToMultiByte.
File paths are in UTF-16. In fact all text is UTF-16 internally in Windows.
For ANSI encoding you will need to read up on that in some detail. You could do worse than to start with Wikipedia and follow the links from there.
I hope that helps and that if I've got anything wrong, anyone who knows more please do edit this to correct any errors!
First of all you'll find plenty of information in this SO topic.
ASCII is a charset, not encoding. Now, there's a number of 8-bit charsets, one of them being set as default in the system (you can change it in Regional Settings). *A functions accept 8-bit characters in that charset. UTF-8 is not a charset, but encoding of Unicode charset. *W functions, as I understand, use UTF-16 rather than UCS-2.

Converting Multibyte characters to UTF-8

My application has to write data to an XML file which will be read by a swf file. The swf expects the data in the XML to be in UTF-8 encoding. I have to convert some Multibyte characters in my app(Chinese simplified, Japanese, Korean etc..) to UTF-8.
Are there any API calls which could allow me to do this?I would prefer not to use any 3rd party dlls. I need to do it both on Windows and on Mac and would prefer any system API's if available.
Thanks
jbsp72
UTF-8 is a multibyte encoding (Well, a variable byte-length encoding to be precise). Stating that you need to convert from a multibyte encoding is not enough. You need to specify which multibye encoding your source is?
I have to convert some Multibyte
characters in my app(Chinese
simplified, Japanese, Korean etc..) to
UTF-8.
if your original string is in multibyte (chinese/arabic/thai/etc..) and you need to convert it to other multibyte (UTF-8), One way is to convert to WideCharacter(UTF-16) first, then convert back to multibyte.
multibyte(chinese/arabic/thai/etc) -> widechar(UTF-16) -> multibyte(UTF-8)
if your original string is already in Unicode(UTF-16), you can skip the first conversion in the above illustration
you can refer the codepage from MSDN.
Google Chrome has some string conversion implementations for Windows, Linux, and Mac. You can see it here or here. the files are under src/base:
+ sys_string_conversions.h
+ sys_string_conversions_linux.cc
+ sys_string_conversions_win.cc
+ sys_string_conversions_mac.mm
The code uses BSD license so you can use it for commercial projects.

Resources