How to use enable pseudo-locale in Windows for testing? - windows

Windows Vista introduced the concept of three pseudo-locales:
Pseudo Locale Locale Name LCID
=================== =========== ======
Base qps-ploc 0x0501
Mirrored qps-mirr 0x09ff
East Asian-language qps-asia 0x05fe
Enabling the Base locale is useful, because you can check that your application is using the current locale for formatting of items such as dates, times, numbers, money.
For example when the current locale is set to Base, a date will be formatted as:
[Шěđлеśđαỳ !!!], 8 ōf [Μäŕςћ !!] ōf 2006
Builds of Windows are actually done in pseudo, and then localized into english:
Engineering Windows 7 for a Global Market
Pseudo-Localization
To prevent common globalization bugs, pseudo-localized builds were created. Pseudo-localization is a process that creates a localized product in an artificial language. That language is identical to English except that each character is written with a different character that visually resembles the English character. Except for being entirely machine generated, we create the pseudo-localized builds exactly the same way as we create the localized builds. Because even monolingual US software developers can read pseudo-localized text, it has proven to be an excellent way to find globalization problems early in the development cycle. In the Windows 7 beta, some UI elements were still in their pseudo-localized form, causing some interesting theories about what the meaning might be. We hope we have solved the mystery with this blog post. :-)
Control Panel Dialog in Pseudo-localized Windows 7
Another value in the use of these locale's: it tests that your application doesn't assume that a 16-bit PRIMARYLANGID is made up of an:
8-bit primary language id
8-bit sublanguage id
when in reality a PRIMARYLANGID is:
a 10-bit primary language id
a 6-bit sublanguage id
or graphically:
+-----------------------+-------------------------+
| Sublanguage ID | Primary Language ID |
+-----------------------+-------------------------+
15 10 9 0 bit
These three pseudo-locale's finally walk off the end of the 8th bit (something that Microsoft has been weary of doing for breaking buggy applications).
How do i enable pseudo-locale's in Windows?
See also
MSDN: Pseudo-Locales
MSDN: Using Pseudo-Locales for Localization Testing
MSDN Blogs: Pseudo Locales in Windows Vista Beta 2
MSDN Blogs: One of my colleagues is the "Pseudo Man" (a rich source of puns in conversation!)
MSDN Blogs: Walking off the end of the eighth bit

How do i enable pseudo-locale's in Windows?
Initially the four pseudo-locale's are not visible in the Control Panel: (archive.org)
Note that NLS does not automatically enumerate the pseudo-locales or expose them in the regional and language options portion of the Control Panel. They are only enumerable if values are set in the registry.
You enable them by adding some registry keys:
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Nls\Locale]
"00000501"="1" // qps-ploc (Windows Vista and later)
"000005fe"="7" // qps-ploca (Windows Vista and later)
"00000901"="1" // qps-Latn-x-sh (Windows 10 and later)
"000009ff"="d" // qps-plocm (Windows Vista and later)
Which can be done in RegEdit:
Then you can go to Regional and Language Options in the Control Panel:
and select the pseudo-locale:
The three different pseudo-locale are for testing 3 kinds of locales:
Base The qps-ploc locale is used for English-like pseudo localizations. Its strings are longer versions of English strings, using non-Latin and accented characters instead of the normal script. Additionally simple Latin strings should sort in reverse order with this locale.
Mirrored qpa-mirr is used for right-to-left pseudo data, which is another area of interest for testing.
East Asian qps-asia is intended to utilize the large CJK character repertoire, which is also useful for testing.
Warning: Do not try to change the "System Locale":
to a new pseudo-locale:
Otherwise after the reboot:
Windows will fail to start:
And the only fix will be to manually edit the registry from the Recovery Console; restoring the old en-US locale.
Warning
Use of pseudo-locales is used to find localization bugs in software. Unfortunately this will also let you find bugs in other people's software; including Microsoft's:
SQL Server Management Studio1 crashes when presented with other locales (Microsoft Connect):
Microsoft Excel will no longer let you enter functions (the comma used to separate parameters no longer works)
Visual Studio will no longer let you edit comma separated properties
The SQL Server Management Studio diagram designer reports an error
.NET has a bug in the date and time formatting, showing 22////11////2011 4::::42::::53 P̰̃M]
Windows Event Viewer:
Task Scheduler:
SQL Server Management Studio:
Good luck with getting Microsoft to dogfood their own product.
110.50.1617.0
Update 4//10/2012:
Trying to Edit top 200 rows of a table in SQL Server Management Studio:
Executed SQL statement SELECT TOP (200) ...
Error Source: Microsoft.SqlServer.Management.DataTools
Error Message: Object reference not set to an instance of an object
Is fixed by changing Negative sign symbol from -- to -.
Bonus Reading
Pseudo Locales in Windows Vista Beta 2 (archive.is)
How do you test your app for Iñtërnâtiônàlizætiøn? (Internationalization?)
Michael Kaplan: One of my colleagues is the "Pseudo Man" (a rich source of puns in conversation!) (RIP) (archive.is)
https://en.wikipedia.org/wiki/Pseudolocalization
MSDN: Using pseudo-locales for localizability testing archive

You can also change Internet Explorer's Accept-Languages to request qps-ploc language:
You can use this to test that your web-site supports psuedo-locale, and check any missing localizations:
You can see i missed two bits of text in this sample web-site.

It looks like rather than fixing the localization bugs in .NET, SQL Server, Excel, etc, Microsoft changed the Pseudo locale in Windows 10 to mask the bugs:
Item
Windows 7
Windows 10
Locale Identifier
0x0501 (1281)
0x0501 (1281)
Locale Name
qps-ploc
qps-ploc
Example Number
--123,,4567,,8901
-123,,4567,,8901
Example Currency
--$$123,,4567,,8901..00
-$123,,4567,,8901.000
Example Float
--123,,4567,,8901..00
-123,,4567,,8901.000
Example Date
9//08//2015
9/8/2015
Example Time
9::51::17 АΜ
9:45:09
Example DateTime
9//08//2015 9::51::17 АΜ
9/8/2015 9:45
LOCALE_SLANGUAGE
Pseudo Language (Pseudo)
Pseudo (Pseudo)
LOCALE_SENGLANGUAGE
Pseudo Language
Pseudo
LOCALE_SDECIMAL
..
.
LOCALE_SCURRENCY
$$
$
LOCALE_SMONDECIMALSEP
..
.
LOCALE_SDATE
//
/
LOCALE_STIME
::
:
LOCALE_SSHORTDATE
d//MM//yyyy
d/MM/yy
LOCALE_STIMEFORMAT
h::mm::ss tt
H:mm:ss
LOCALE_ITIME
0
1
LOCALE_ICENTURY
1
0
LOCALE_SNEGATIVESIGN
--
-
I can understand not wanting to fix your bugs, because you're lazy it's too hard. But you should have been forced to wear your shame for all to see.
Instead you cop-out and try to hide your failure. That's just bad.

Windows 10 1803
Unfortunately, as of Windows 10 1803, it appears to no longer be possible to enable these locales: archive
For Windows 10, version 1803, editing the Windows Registry like this has no effect. But you can still call the non-enumerating NLS APIs with the names of the pseudo-locales (see the code examples above) to populate your user interface (UI).
According to Unable to use psuedo locales after 1803 Win 10 update archive
Hi all, I broke how the pseudo locales enumerate, my bad, very sorry about that :(
Note that they still work as they are built-in to Windows, “just” that they don't show up in the enumeration - so they don't show up in the drop down - so that makes them a bit trickier to use. I'm working to find a workaround.
Basically, if you copy the registry values from Computer\HKEY_CURRENT_USER\Control Panel\International (not the subkeys) from a machine using the appropriate pseudo locale, then that should be used for further processes, even if it is not enumerated.
Shawn Steele (MSFT)
[Шěđлеśđαỳ !!!], 18 ōf [Јúłў !!] ōf 2018
Registry values for manual config
Here are the exported values from a 1607 system. They can be put into a .reg file for easy import.
If using a .reg file, the following header is required:
Windows Registry Editor Version 5.00
Pseudo (Pseudo) [qps-ploc]
HKCU_Control Panel_Internaltional - qps-ploc - W7.reg (before MS gave up on fixing their localization bugs)
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Control Panel\International]
"Locale"="00000501"
"LocaleName"="qps-ploc"
"s1159"="АΜ"
"s2359"="P̰̃M]"
"sCountry"="Pseudo"
"sCurrency"="$$"
"sDate"="//"
"sDecimal"=".."
"sGrouping"="4;0"
"sLanguage"="ENU"
"sList"=",,"
"sLongDate"="dddd, d 'ōf' MMMM 'ōf' yyyy"
"sMonDecimalSep"=".."
"sMonGrouping"="4;0"
"sMonThousandSep"=",,"
"sNativeDigits"="0123456789"
"sNegativeSign"="--"
"sPositiveSign"="++"
"sShortDate"="d//MM//yyyy"
"sThousand"=",,"
"sTime"="::"
"sTimeFormat"="h::mm::ss tt"
"sShortTime"="h:mm tt"
"sYearMonth"="MMMM yyyy"
"iCalendarType"="1"
"iCountry"="61"
"iCurrDigits"="3"
"iCurrency"="0"
"iDate"="1"
"iDigits"="3"
"NumShape"="1"
"iFirstDayOfWeek"="0"
"iFirstWeekOfYear"="0"
"iLZero"="1"
"iMeasure"="1"
"iNegCurr"="1"
"iNegNumber"="1"
"iPaperSize"="1"
"iTime"="0"
"iTimePrefix"="0"
"iTLZero"="0"
Pseudo (Pseudo Asia) [qps-ploca]
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Control Panel\International]
"Locale"="000005FE"
"LocaleName"="qps-ploca"
"s1159"="午前"
"s2359"="午後"
"sCountry"="Pseudo Asia"
"sCurrency"="¥"
"sDate"="/"
"sDecimal"="."
"sGrouping"="3;0"
"sLanguage"="JPN"
"sList"=","
"sLongDate"="yyyy'年'M'月'd'日'"
"sMonDecimalSep"="."
"sMonGrouping"="3;0"
"sMonThousandSep"=","
"sNativeDigits"="0123456789"
"sNegativeSign"="-"
"sPositiveSign"=""
"sShortDate"="yyyy/MM/dd"
"sThousand"=","
"sTime"=":"
"sTimeFormat"="H:mm:ss"
"sShortTime"="H:mm"
"sYearMonth"="yyyy'年'M'月'"
"iCalendarType"="1"
"iCountry"="81"
"iCurrDigits"="0"
"iCurrency"="0"
"iDate"="2"
"iDigits"="2"
"NumShape"="1"
"iFirstDayOfWeek"="6"
"iFirstWeekOfYear"="0"
"iLZero"="1"
"iMeasure"="0"
"iNegCurr"="1"
"iNegNumber"="1"
"iPaperSize"="9"
"iTime"="1"
"iTimePrefix"="0"
"iTLZero"="0"
Pseudo (Pseudo Mirrored) [qps-plocm]
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Control Panel\International]
"Locale"="000009FF"
"LocaleName"="qps-plocm"
"s1159"="ص"
"s2359"="م"
"sCountry"="Pseudo Mirrored"
"sCurrency"="ر.س.‏"
"sDate"="/"
"sDecimal"="."
"sGrouping"="3;0"
"sLanguage"="ARA"
"sList"=";"
"sLongDate"="dd/MMMM/yyyy"
"sMonDecimalSep"="."
"sMonGrouping"="3;0"
"sMonThousandSep"=","
"sNativeDigits"="٠١٢٣٤٥٦٧٨٩"
"sNegativeSign"="-"
"sPositiveSign"=""
"sShortDate"="dd/MM/yy"
"sThousand"=","
"sTime"=":"
"sTimeFormat"="hh:mm:ss tt"
"sShortTime"="hh:mm tt"
"sYearMonth"="MMMM, yyyy"
"iCalendarType"="23"
"iCountry"="966"
"iCurrDigits"="2"
"iCurrency"="2"
"iDate"="1"
"iDigits"="2"
"NumShape"="0"
"iFirstDayOfWeek"="5"
"iFirstWeekOfYear"="0"
"iLZero"="1"
"iMeasure"="0"
"iNegCurr"="3"
"iNegNumber"="3"
"iPaperSize"="9"
"iTime"="0"
"iTimePrefix"="0"
"iTLZero"="1"
Pseudo (Pseudo Selfhost) [qps-Latn-x-sh]
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Control Panel\International]
"Locale"="00000901"
"LocaleName"="qps-Latn-x-sh"
"s1159"="AM"
"s2359"="PM"
"sCountry"="Pseudo Selfhost"
"sCurrency"="J$"
"sDate"="/"
"sDecimal"="."
"sGrouping"="3;0"
"sLanguage"="ENJ"
"sList"=","
"sLongDate"="dd MMMM, yyyy"
"sMonDecimalSep"="."
"sMonGrouping"="3;0"
"sMonThousandSep"=","
"sNativeDigits"="0123456789"
"sNegativeSign"="-"
"sPositiveSign"=""
"sShortDate"="dd/MM/yyyy"
"sThousand"=","
"sTime"=":"
"sTimeFormat"="HH:mm:ss"
"sShortTime"="HH:mm"
"sYearMonth"="MMMM, yyyy"
"iCalendarType"="1"
"iCountry"="1"
"iCurrDigits"="2"
"iCurrency"="0"
"iDate"="1"
"iDigits"="2"
"NumShape"="1"
"iFirstDayOfWeek"="6"
"iFirstWeekOfYear"="0"
"iLZero"="1"
"iMeasure"="0"
"iNegCurr"="1"
"iNegNumber"="1"
"iPaperSize"="1"
"iTime"="1"
"iTimePrefix"="0"
"iTLZero"="1"

Related

How do I balance script-oriented OpenType features with other OpenType features using DirectWrite?

Full disclosure: I'm working on my libui GUI framework's text API. This wraps DirectWrite on Windows, Core Text on OS X, and Pango (which uses HarfBuzz for OpenType shaping) on other Unixes. One of the text formatting attributes I want to specify is a collection of OpenType features to use, which all three provide; DirectWrite's is IDWriteTypography.
Now, when you draw some text with these libraries, by default you'll get a few useful OpenType features enabled, such as the standard ligatures (liga) like the f+i ligature. I thought this was font-specific, but it turns out this is specific to the script of the text being shaped. Microsoft provides guidelines for all the scripts supported by OpenType (under "Script-specific Development"), and I can see rather complex logic for doing it all in HarfBuzz itself to confirm it.
On Core Text and Pango, if I enable other attributes, they'll be added on top of these defaults. But with DirectWrite, in particular IDWriteTextLayout::SetTypography(), doing so removes the defaults:
The program that produces this output is can be found here.
Obviously my first option would be to ask how to get the default features on DirectWrite. Someone did so already on this site, though, and the answer seems to be "no".
I am guessing that DirectWrite is allowing me to be in complete control of the list of features to apply to some text. This is nice, except that I can't do this with the other APIs unless I explicitly disable the default features somehow! Of course, I don't know if this list will ever change, so hardcoding it might not be the best idea.
Even if hardcoding is an option, I could just grab HarfBuzz's list for each script, but a) it's rather complicated b) there are multiple possible shapers for a script, depending on (I think) version compatibility (for instance, Myanmar).
So why not use HarfBuzz's lists to recreate the default list of features for DirectWrite anyway? It seems to want to be accurate to other shapers anyway, so this should work, right? Well I would need to do two things: figure out what script to use, and figure out which attributes to use on which characters for script where the position of a character in the word matters.
DirectWrite provides an interface IDWriteTextAnalyzer that provides facilities to perform shaping. I could use this, but it seems the script data is returned in a DWRITE_SCRIPT_ANALYSIS structure, and the description for the script ID says "The zero-based index representation of writing system script.".
This doesn't help, so I wrote a program to just dump the script numbers for text I type in. Running it on the input string
لللللللللللللاااااااااالا abcd محمد ابن بطوطة‎‎ Отложения датского яруса
yields the output
0 - 26 script 3 shapes 0
26 - 5 script 49 shapes 0
31 - 14 script 3 shapes 0
45 - 2 script 1 shapes 1
47 - 25 script 22 shapes 0
I cannot match these script numbers to anything in any of the Windows headers: if there is a defined number for Arabic, Latin, or Cyrillic in any API, they don't match these. And even if I did get a mapping between script and script number, that still doesn't give me the data to apply intra-word features.
What about Uniscribe? Well, the documentation for the equivalent SCRIPT_ANALYSIS type says that its script ID is an "[opaque] value" whose "value for this member is undefined and applications should not rely on its value being the same from one release to the next". And while I can get a language code to identify the script by, there's still no defined value other than LANG_ENGLISH for "Western" (Latin?) scripts. Are the DirectWrite values the same as the Uniscribe ones? And it seems like I can at least figure the initial and final states of words by looking at the fLinkBefore and fLinkAfter fields, but is this enough to properly apply attributes per-script?
HarfBuzz does have an experimental DirectWrite backend that isn't intended to be used by real programs; I'm not yet sure whether it has the same feature-clobbering I specified above. If I find out, I'll update this part here.
Finally, if I enter the following equivalent test case to the first one above in something like kaxaml:
<Page
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">
<Grid>
<FlowDocumentPageViewer>
<FlowDocument FontFamily="Constantia" FontSize="48">
<Paragraph>
afford afire aflight 1/4<LineBreak/>
<Run Typography.Fraction="1">afford afire aflight 1/4</Run>
</Paragraph>
</FlowDocument>
</FlowDocumentPageViewer>
</Grid>
</Page>
I see the ligatures being applied properly, even in the latter case:
(The fraction at the end is just to prove that that attribute is being applied.) If I assume XAML uses DirectWrite, then that proves my first option (simply overlaying my custom attributes on top of the defaults) should be possible... (I make this assumption based on the idea that XAML provides a strikingly similar API to Direct2D for drawing 2D graphics, and has a lot of holes filled in where I had to manually write a lot of glue code to do the same things with vanilla Direct2D, so I assume whatever is possible in XAML is possible with Direct2D, and by extension DirectWrite since they were technically introduced together...)
At this point I'm completely lost. I want to at least be predictable across platforms, and I'm not sure how programs are even supposed to, let alone going to, use OpenType features directly or not anyway. Am I making bad expectations of text layout APIs? Will I have to drop IDWriteTextLayout and do all the text shaping and layout myself if I want this?
Or do I have to drop vanilla Windows 7 support and upgrade to the Platform Update DirectWrite feature set? Or even Windows 7 entirely?
After some discussions with Peter Sikking and Ebrahim Byagowi, I went and debugged a more general-purpose program I built quickly to test things, and I figured out what's going on internally.
First, however, I will say this applies to Uniscribe and DirectWrite equally.
As it turns out, DirectWrite is always providing a set of default OpenType features, regardless of what feature set I use! The situation is that the list of default features provided differs depending on whether I load my own features or not, and depending on the shaping engine. For the latn script in horizontal writing mode and for English, this is done with the "generic engine".
If I don't provide any features, the generic engine will load script-specific features. For horizontal latn, this list is
locl
ccmp
rlig
rclt
calt
liga
clig
If I do provide features, the generic engine will use the same default list for all scripts:
locl
ccmp
rclt
rlig
mark
mkmk
dist
So I don't know what to do about this. I could probably just provide liga and a few others myself in libui code (marked as a HACK of course), but this is still weird. I'm not sure what the motivation is either. Either way, this explains the behavior I'm seeing.
Supposing your question in general is about programming or at least concerns programming, I will try and give answers to some of your interrogative sentences.
would I have to drop the use of IDWriteTextLayout entirely in my code if I want to be able to add typographical features on top of the defaults?
It depends. If an IDWriteTextLayout interface suits well your project tasks in all ways except ease of variation of DirectWrite default typographic features, learn what you should about typography and create an IDWriteTypography instance suitable for your needs. Developing a custom text layout for the program may require substantial time and effort, especially if the program is supposed to render bidirectional texts, complex scripts, inline objects, etc.
It may happen that the tasks of your project require to develop a text layout engine for reasons other than just controlling typographic features used in rendered text. For example, your manager/customer may ask for implementation of customized linebreaking opportunities or a glyph advance justification algorithm. In this scenario, you will implement an IDWriteTextAnalizer::GetGlyphs method. This method has parameters DWRITE_TYPOGRAPHIC_FEATURES ** features, const UINT32 * featureRangeLengths, UINT32 featureRanges, and this parameters enable you to supersede a set of "default" typography features for a range of the text to be rendered (see my answer to the other question What are the default typography settings used by IDWriteTextLayout?). Only affected features will be altered; the other features has their "default" values. Morever, if you omit this parameters in a GetGlyphs call for the next text range (for example, use values of NULL, NULL, 0), the features altered in the previous GetGlyphs call will not be altered by the call for this next range.
the documentation for the equivalent SCRIPT_ANALYSIS type says that its script ID is an "[opaque] value" whose "value for this member is undefined and applications should not rely on its value being the same from one release to the next". And while I can get a language code to identify the script by, there's still no defined value other than LANG_ENGLISH for "Western" (Latin?) scripts.
Strictly speaking, this is not an interrogative statement, but I guess you are dissatisfied with how these Unicode script IDs are defined and how one can use the API with so vaguely defined structures and constants.
It may be off topic, but I risk to hypothesize on the origin of the "Unicode script ID" values. As of 2010-07-17, the Unicode, Inc. published The Unicode 6.0 version. The standard contained the document
http://www.unicode.org/Public/6.0.0/ucd/PropertyValueAliases.txt, with a section containing a list of scripts. The list went so:
# Script (sc)
sc ; Arab ; Arabic
sc ; Armi ; Imperial_Aramaic
etc.
The Arabic script is #1, the Cyrillic script is #20, the Latin script is #47 in this list. Furthermore, elsewhere I saw this list starting with scripts Common and Inherited. It places the Arabic script to the 3rd, the Cyrillic to the 22nd, and the Latin to the 49th place. These ordinals are familiar to you, aren't they?
Fortunately, we need not rely on the "Unicode script ID" values; we need script properties, not script IDs or abbreviations. The API is self-consistent in that it gives actual script properties for the text range, when we pass to a GetScriptProperties method the number derived from an AnalyzeScript call.

how to operate typeperf on non-English versions of windows?

i found this nice log-creating command line:
typeperf "\Processor(_Total)\% Processor Time"
so far it worked for me well on an English language version of Windows 7 (or similar).
when trying out the very same thing on a German language Windows 7 it simply did not work.
how can the same functionality be triggered with that tool on a German (or other language) Windows 7?
the best line so far for German windows is this:
Get-Counter '\Prozessor(_Total)\Prozessorzeit (%)'
It has multi-line outputs with the value in question typically printed with a comma as decimal separator (in contrast to the English dot). for 100% there is no dot given. parsing the results down to the value looks a bit difficult.
having a more generic solution would still be more nice. the web page linked below helped me a bit in understanding what the key problem is.
https://social.technet.microsoft.com/Forums/de-DE/25bc6907-cf2c-4dc8-8687-974b799ba754/powershell-ausgabesprache-umstellen?forum=powershell_de
that far i am not sure if it's possible to make it truly generic using such helpers e.g. the keyword listing - but i am not that deep in what PS offers, rather i am skilled in cmd.exe.
Maybe this helps somehow further: you can find out the corresponding counter name in your language by comparing these registry keys.
English:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Perflib\009\Counter
Current language:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Perflib\CurrentLanguage\Counter
Here you have the list of names and IDs as value, so you can match them and find out the right one.

Is it possible to set PercentPositivePattern in Windows UI?

In an application that is internationalized, I have code like this:
_comboBlah.Items.Add(pct.ToString("P0", formatInfo));
where formatInfo is normally from the CultureInfo for the current UI language (which might well be different from the CurrentCulture. I was surprised that when "en" is the UI language (and incidentally also the CurrentCulture), the value of formatInfo.PercentPositivePattern is 0 (meaning that the number and percent sign will be separated by a space). Since this is not the normal way to format percentages in English (U.S.), I went into Regional and Language Settings to see if I could see why it was set this way and change it to format without the space (PercentPositivePattern = 1). I couldn't find any way to set this in the Windows UI. Is there a way? Anyone know why this is the default? (It's not the default in Excel.) Is there any way around this besides changing it programmatically when "en" is the UI language? Is there any hope that MS got it right for other built-in locales?

Unicode Normalization in Windows

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.
MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:
Win32 and the .NET Framework support all four normalization forms.
However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.
Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()?
Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?
From the MSDN article Using Unicode Normalization to Represent Strings.
Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.
Update: I've included some specific details relating to Question #2.
In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.
There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.
In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.
One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.
what normalization form is used by default for user input
Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.
Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.
For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â) and some are typed as a combining diacritical (eg grave à). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.
This isn't in normal form C (which would be ầ U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ U+0061,U+0302,U+0300).
There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.
are the kernel and file system normalization-agnostic?
Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt, ầ.txt and ầ.txt in the same folder.
First of all, thanks for an excellent question. I found the answer in Michael Kaplan's blog:
But since all of the methods of text input on Windows tend to use the same normalization form already (form C), ...

Flex 4 Combo - using IME

I am trying to use ime (for hiragana input) in a flex 4 spark combo.
On creation complete I am setting the following.
cbx_text.textInput.imeMode = IMEConversionMode.JAPANESE_HIRAGANA;
And to check, tracing the following:
trace(cbx_text.textInput.enableIME); returns true;
trace(cbx_text.textInput.imeMode); returns JAPANESE_HIRAGANA;
However, when I select the text input and start to type some text I am unable to switch to hiragana.
I can set it to work on a textinput component with no problems.
<s:TextInput imeMode="JAPANESE_HIRAGANA"></s:TextInput>
Has anyone had any experience with this?
Any insights much appreciated.
Although I haven't had any experience with IME, I took a quick look at the documentation : http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/flash/system/IME.html
Can it be that it's not enabled application wise? That, maybe what returns true is only valid for the component you are tracing from?
Obvious questions first:
Are you certain the TextInput is a member of cbx_text? I know this seems silly, but it's best to eliminate the obvious first.
Do you have an IME enabled on your computer? For example, do you regularly type in hiragana on your computer and have the appropriate language pack enabled?
Are you sending the IME the string appropriately? IME.setCompositionString() for windows computers?
Does your OS support the use of IMEs? Linux only supports the following methods:
Capabilities.hasIME
IME.enabled <= Can set or return value.
Try tracing hasIME and see if it's installed. Again, we're shotgunning here – trying to track down any possibility of a problem.
When all else fails, go to the source:
http://livedocs.adobe.com/flex/3/html/help.html?content=18_Client_System_Environment_6.html

Resources