Go Unicode combining characters (grapheme clusters) and MS Windows Console cmd.exe - go

In the following code, the ü is not the single Unicode character U+00FC but is a single grapheme cluster composed of two Unicode characters, the plain ASCII u U+0075 followed by the combining diaeresis U+0308.
fmt.Println("Jürgen Džemal")
fmt.Println("Ju\u0308rgen \u01c5emel")
If I run it in the go playground, it works as expected.
If I run it in a MS Windows 10 "Command Prompt" window, it doesn't visually combine the combining character with the prior character.
However when I cut and paste the text into here it appears correctly:
C:\> ver
Microsoft Windows [Version 10.0.17134.228]
C:\> test
Jürgen Džemal
Jürgen Džemel
On screen, in the "Command Prompt" window it looked more like:
Ju¨rgen Džemel
Changing the code page (chcp) from 850 to 65001 made no difference. Changing fonts (Consolas, Courier etc) made no difference.
In the past I have experienced problems that were fundamentally because Microsoft require Windows programs to use a different API to output characters to STDOUT depending on whether STDOUT is attached to a console or to a file. I don't know if this is a different manifestation of the same issue.
Is there something I can do to make this Unicode grapheme-cluster appear correctly?

As eryksun and Peter commented,
The Windows console (conhost.exe) doesn't support combining codes. You'll have to first normalize to an equivalent string that uses precomposed characters.
you can use golang.org/x/text/unicode/norm to do the normalization (e.g. norm.NFC.String("Jürgen Džemal"))
I tried this
s := "Ju\u0308rgen \u01c5emel"
fmt.Println(s) // dieresis not combined with u by conhost.exe
s = norm.NFC.String(s)
fmt.Println(s) // shows correctly
And the output looked like this
or, for the visually impaired with fabulously sophisticated screen readers - a bit like this:
Ju¨rgen Džemel
Jürgen Džemel
Note that Unicode has four different normalised forms but NFC is the most used on the Internet in web-pages and is also appropriate for this situation.
There are other methods in this package that may be more efficient or more useful
I read there are visual-characters in use which can only be represented in Unicode using combining characters. In other words for which there is no precomposed character. A more thorough approach would be needed to do something appropriate with those. Essentially the complications of Unicode (or perhaps more accurately of human languages and their typography) are almost without end. It sometimes seems that way to me.
References
https://blog.golang.org/normalization
https://godoc.org/golang.org/x/text/unicode/norm
https://learn.microsoft.com/en-us/windows/desktop/intl/using-unicode-normalization-to-represent-strings
For example, several characters used in writing Lithuanian have double diacritics, as they have only decomposed forms. An example is lowercase U with macron and tilde ("ū̃", U+016b U+0303, where the first code point is a lowercase U with macron and the second is a combining acute accent).

Related

Ruby gets method makes command prompt crash

I am currently making an app that deals with Ancient Greek (Unicode extended) characters. To put it simple, the user types a word in Greek and the program analyses it morphologically and shows all its declension, i. e. all its possible forms. Pretty simple, I am new to this. I made it with Latin (a language with no special Unicode characters) a week ago and it works perfectly.
I work with Ruby 3.0.2 and the Command Prompt attached to its installation file. I can write code using Greek Unicode letters (like "puts "ἀγαθός"") and they are displayed on the Command Prompt without problem. UTF-8 works fine there. I think the default codepage is UTF-8 for the .rb file.
The problem is when the Command Prompt tries to recognize the letters the user has written. To test recognition, when the user writes a word, the program shows again the letters written (I wrote them using the Windows Polytonic Greek Keyboard). To my disgrace these letters are only squares with question marks inside them. After a while, I couldn't write any single Greek character (not even non-Unicode-extended) because the command prompt crashes after doing it.
One solution I made was to change, before running the program, the codepage to Windows-1253, which supports Greek characters but not Unicode characters. That made possible to write Greek common characters and the command prompt recognizes them well. But of course it continues crashing if I dare to write a single alpha with spiritus asper.
But I really would like to use UTF-8 for everything, and I don't know why the program is doing this. Of course Windows Powershell does the same.
I hope I have explained the problem well. Sorry if my language is not appropriate, but I hope you have got the point. Thanks!

Makes sense to use wchar_t/wmain in a windows c++ console application?

I have been writing a new command line application in C++. One platform we support is, of course, Windows.
The Windows console, by default, uses the OEM code pages depending on the locale (for example, on my machine it is CP437 / DOS.Western). I think, if it was a Windows Cyrillic version, it would have been CP866, and so on. These OEM code pages contain only 256 characters)
I think what this means is the Windows console translates the input key strokes into characters based on the default code page. (And, depending on the currently selected fonts, if there is a corresponding glyph, it is displayed).
In such a case, whether does it makes sense to use wmain/wchar_t and wide char types in my application?
Is there any advantage of using wide types? Or is there any grave problem if just char * is used?
When wide char types are used, what is the encoding of the command line arguments and environment strings - (wchar_t * argv[] and wchar_t * envp[]), i mean. Are they converted to UTF-16 by Windows CRT, or are they untouched?
Thanks for your contributions.
You seem to be assuming that Windows internally works in the specified codepage. That's not true. Windows internally works in Unicode (UTF-16). For legacy software that uses char instead of wchar_t, input and output are translated into the specified codepage.
I think what this means is the Windows console translates the input key strokes into characters based on the default code page
This is not correct. The mapping of key strokes to (Unicode) characters is defined by the keyboard layout. This is totally independent of the code page. E.g you could use a Chinese keyboard layout on a system using a Cyrillic code page.
Not only makes it totally sense to usewchar_t, it is the recommended way.
Yes, there is an advantage: your program can process all characters supported by Windows. If you use char, you can't handle any characters that are not in the current code page.
They are not converted - they stay what they are, namely UTF-16 characters.
Unfortunately, the command prompt itself is an 'ANSI' application, so it suffers from all of the limitations of 'ANSI', and this affects your application if you use it from the command prompt. However, a console application can be used in other ways, without a command prompt window, and then it can support Unicode fully.

Does Windows console supports ANSI?

Does the Windows console supporsts ANSI control characters?
It doesn't support many ANSI control characters by default (which is also mentioned in the wikipedia article http://en.wikipedia.org/wiki/ANSI_escape_code), but there are ways to make that possible.
Look into the answers to this question: How to load ANSI escape codes or get coloured file listing in WinXP cmd shell?
You might happen upon something useful.
I assume you're referring to ASCII control characters.
The answer is "some". You can read backspace keypresses, for example, and you can pipe-in things like the ASCII "Bell" character.
However if you mean that the Windows console automatically resolves escaped characters, such as converting "\b" into "Bell", then no, you have to do that yourself.
Note that I speak about entering keypresses directly into the console and not batch files, for that see #ProblemFactory's answer.

How many valid utf8 characters are there?

I know that this is a little vague, so for context, think of it as "a character you could tweet," or something like that. My question is how many valid unicode characters are there that a browser or a service that supports utf8 could resolve, in such a way that a utf8 browser could copy and paste it around without any issues.
I guess what I don't want is the full character space, because I know a lot of it is reserved for command characters or reserved characters that wouldn't be shown (unless I'm super wrong!).
UTF-8 isn't the important factor, since all of the standard Unicode encodings (UTF-8, UTF-16, UTF-32) encode the same character space, just in different ways.
From your explanation I see you don't just want the 1,112,064 valid Unicode code points?
Unicode 6.0 and ISO/IEC 10646:2010 define 109,449 characters, but a handful of those are what you're calling "control characters". Which ones do or don't fall into that category depends on how you're counting. Copying and pasting may result in some characters being treated as identical to one another, or ignore altogether, depending on the OS and the programs doing the copying and pasting.
However because Unicode is forward compatible, some systems will correctly preserve characters which haven't yet been assigned. After all, just because you're running Windows XP and you copy and paste a document with characters that weren't standardised until 2009 doesn't mean you expect them to vanish. There could be a million or so extra possible characters by this way of thinking, although their visual appearance may be indistinguishable in some places.

ASCII in Windows XP and Ubuntu Linux

I've made a program in MVSC++ which outputs memory contents (in ASCII). The ASCII I see in windows console seem to match what I see in various ASCII tables (smiley, diamond, club, right arrow etc). This program needs to compile under Linux (which is does), but the ASCII output looks completely different. A few symbols are the same but the rest are so different. Is there any way to change how terminal displays ASCII code?
EDIT: The program executes correctly, it's just the ASCII that is being displayed differently.
ASCII defines character codes from 0x00 through 0x7f. Everything else (0x80-0xff) is not part of the ASCII standard and depends on what the operating system defines as the characters to display. However, the characters you mention (smiley, diamond, club, etc) are the representations of the ASCII "control characters" that don't normally have a visual representation. Windows lets you print such characters and see the glyphs it has defined for them, but your Linux is probably interpreting the control characters as formatting control codes (which they are) instead of printing corresponding glyphs.
What you are seeing is the "extended" character set that IBM initially included when PCs were first unleashed upon the world. Yes, we are going back to the age of mighty dinosaurs, so bear with me. These characters live above $7F and the interpretation of their symbols on the screen can even be influenced by the font chosen. Most linux distros are now using UTF-8 (or something close) and as such, the fonts installed may have completely different symbols, or even missing glyphs. In cases where you are comparing "ASCII" representations (which is a misnomer, as it's not really true ASCII) of the same data, it may or may not exactly match, as you must have the same "glyph" renderings in both display fonts to correctly see similar representations. Try getting both your Windows and Linux installs to use the same font if possible, and then see if there is a change.
If your browser supports Unicode (and you have the correct fonts installed), you will see them bellow.
You can copy and paste into an editor with unicode support(Notepad). Save as UTF-16BE
Then if you open in a HexEditor you will see all the unicode codes for each char visible glyph.
In example the first ascii char Null has Unicode visible glyph 0x2639
in c\c++\java you can use it like \u2639.
Its not a null char but the visual representation.
http://en.wikipedia.org/wiki/Code_page_437
☹☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼ !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■⓿

Resources