Relevant RFCs/standards for implementing an unicode POSIX terminal emulator? - terminal

What are the relevant standards, man pages, RFCs or other pieces of documentation when implementing a POSIX-style unicode terminal emulator?
The scope of this question spans everything from handling multi-codepoint unicode characters and other unicode pitfalls, behaviour of the terminal when resizing, control sequences to RGB values associated with certain color codes.
While articles such as the Wikipedia page on ANSI escape sequences might suffice for using a terminal emulator, writing one that will behave correctly for all applications, which includes correctly handling invalid, unknown or user-defined inputs requires actual standard documentation.
My best source so far are ECMA-048, man 3 termios, and the source code of various other terminal emulators.

Obviously, you already added The Unicode Standard to your list of sources. :-)
By a POSIX-style unicode terminal emulator, do you mean a terminal emulator accepting the whole Unicode charset (or a large subset of it), and running on POSIX-compliant operating system? Then since POSIX restricts itself since 2001 on 8-bit chars, this pretty much means a UTF-8 terminal emulator, a restricted case of such emulators where you would not have to deal with various charsets and encodings (definitively a good thing) but where characters are basically multi-byte, which in turn may call for functions like wcwidth(3) (which is not strictly POSIX, only XPG by the way); more generally, rendering problems can be arbitrarily complex with Unicode, including BiDi, Indic scripts with reordering vowels, ...
If you mean something else, please elaborate.
Otherwise, since an emulator also relies on a keyboard, you may encounter interesting content on Wikipedia.
Another source of documentation with a lot of information you might use is the Microsoft Go Global site.

Related

How do I transmit a unicode character above 0xffff to a Windows (Possibly Win32?) Application?

I'm using Notepad++ as my test case, but I was also able to replicate the exact same behavior on a program I wrote that (indirectly) uses the Win32 API, so I'm trusting at this time that the behavior is specific to Windows, and not to the individual programs.
At any rate, I'm trying to use alt-codes to transmit unicode text to a program. Specifically, I'm using the "Alt Plus" method described in the first point here. For any and all alt-codes below 0xffff, this method works perfectly fine. But the moment I try to transmit an alt-code above this threshold, the input only retains the last four digits I typed.
From notepad++ using UCS-2 Big Endian:
To generate this output, I used the following keystrokes:
[alt↓][numpad+][numpad1][numpad2][numpad3][numpad4][alt↑]
[alt↓][numpad+][numpad1][numpad2][numpad3][numpad5][alt↑]
[alt↓][numpad+][numpad1][numpad2][numpad3][numpad4][numpad5][alt↑]
Knowing that the first two bytes are encoding specific data, it's clear that the first two characters transmitted were successful and correct, but the third was truncated to only the last 2 bytes. Specifically, it did not try to perform a codepoint→UTF16 conversion; it simply dropped the higher bits.
I observed identical behavior on the program I wrote.
So my questions then are in two parts:
Is this a hard constraint of Windows (or at least the Win32 API), or is it possible to alter this behavior to split higher codepoints into multiple UTF-16 (or UTF-8) characters?
Is there a better solution for transmitting unicode to the application that doesn't involve writing a manual parser in the software itself?
To expand on that second point: In the program I wrote, I was able implement the intended functionality by discarding the "Text Handling" code and writing, by hand, a unicode parser in the "Key Handling" code which would, if the [alt] key were held down, save numpad inputs to an internal buffer and convert them to a codepoint when [alt] was released. I can use this as a solution if I must, but I'd like to know if there's a better solution, especially on the OS side of things, rather than a software solution.

Is it bad idea to use Flex/Bison as parser for backend of terminal emulator?

I am implementing terminal emulator similar to default terminal emulator for OSX: Terminal.app.
I open terminal connection with openpty and then use Flex to parse incoming input: normal text and control sequences and Bison which calls callbacks based on tokens produced by Flex (insert string, cursor forward sequence etc).
Along with normal text tokens I have implemented around 30 escape sequences without any outstanding problems.
I made Flex/Bison re-entrant because I needed multiple terminal windows to work simultaneously
I did some workarounds to make Flex/Bison to read continuous output based on my another question: How to detect partial unfinished token and join its pieces that are obtained from two consequent portions of input?.
So far it looks like Flex/Bison do their job, however I suspect that sooner or later I will encounter something that reveals the fact that Flex/Bison should not be used as a tool to parse terminal input.
The question is: what are the problems that Flex/Bison can cause if they are used instead of hand-written parser for terminal input? Can performance be a concern?
Abbreviations are sometimes helpful:
DPDA "deterministic push-down automata"
DFA "deterministic finite automata"
What #ejp is saying is that bison is not needed in your solution because there is only one way that the tokens from lexical analyzer can be interpreted. The stack which was mentioned is used for saving the state of the machine while looking at the alternative ways that the input can be interpreted.
In considering whether flex (and possibly bison) would be a good approach, I would be more concerned with how you are going to solve the problem of control characters which can be interspersed within control sequences.
Further reading:
Control Bytes, Characters, and Sequences

Within Windows MBCS, what languages have 2 byte characters and what characters are they?

I have a legacy application that uses Window's old MBCS. The software is international, and uses code pages to make it work for other languages. I've read that Chinese contains multibyte characters. My question is, which ones and how do I generate them on a computer in the USA? I need this for testing.
I think the standard of MBCS has the difference between Japan and China, Korea.
It depends on each country's language.
Though that is able to use by Windows OS in each country, for example Windows 7, xp.
You should change language option on control panel.
What you should be writing nowadays are Unicode applications, which don't have to worry about MBCS encodings. I mean, sure, there are Unicode characters that use variable length encodings, like surrogates in UTF-16, but you shouldn't have to do anything special to make these work. If you want to test them with your app, just look up a table of Unicode characters on the web.
In your case, you're actually working with a legacy non-Unicode application. These use the default system codepage. The only multi-byte character sets (MBCS) supported by legacy Windows applications is a double-byte character set (DBCS)—in particular, Chinese, Japanese, and Korean:
Japanese Shift-JIS (932)
Simplified Chinese GBK (936)
Korean (949)
Traditional Chinese Big5 (950)
Since you are asking this question, I'm assuming that you don't speak any of these languages and don't have your system configured to use any of them. That means you will need to change your system's default codepage to one of these. You might want to do that in a VM. To do so, open the "Region" control panel (how to find it depends on your version of Windows), select the "Administrative" tab, and click "Change system locale." You'll need to reboot after making this change.
I've heard that you can use Microsoft's AppLocale utility to change the codepage for an individual application, but it does have some limitations and compatibility problems. I've never tried it myself. I also don't think it works on newer versions of Windows; the last supported versions are Windows XP/Server 2003. I would recommend sticking with an appropriately-localized VM.
Again, you can find tables of characters supported by these codepages online (see links below), or by using the Character Map utility on a localized installation. As Hans suggested in a comment, an even easier way to do it might be copying and pasting Simplified Chinese text (e.g., for CP 936) from a webpage on the Internet.
As far as the technical implementation goes, a DBCS encodes characters in two bytes. The first (lead) byte signals that it and the byte to follow are to be interpreted as a single character. MBCS-aware functions (with the _mbs prefix in Microsoft's string-manipulation headers) recognize this and process the characters accordingly. The lead bytes are specifically reserved and defined for each codepage. For example, CP 936 (Simplified Chinese) uses 0x81 through 0xFE as lead bytes, while CP 932 (Japanese) uses 0x81 through 0x9F as lead bytes. If you use the string functions designed to deal with MBCS, you shouldn't have a problem. You will only have difficulty if you were careless enough to have fallen back to naïve ACSII-style string manipulation, iterating through bytes and treating them as individual characters.
If at all feasible, though, you should really strongly consider upgrading the app to support Unicode. Obviously there is no guarantee that it will be easy, but it won't be any harder than fixing a lack of support for MBCS codepages in a legacy non-Unicode application, and as a bonus, the time you spend doing so will pay many more dividends.

What events occur when I enter text into a field? What text encoding is my input in?

I'm using the keyboard to enter multi-lingual text into a field in a form displayed by a Web browser. At an O/S-agnostic and browser-agnostic level, I think the following events take place (please correct me if I'm wrong, because I think I am):
On each keypress, there is an interrupt indicating a key was pressed
The O/S (or the keyboard driver?) determines the keycode and converts that to some sort of keyboard event (character, modifiers, etc).
The O/S' window manager looks for the currently-focused window (the browser) and passes the keyboard event to it
The browser's GUI toolkit looks for the currently-focused element (in this case, the field I'm entering into) and passes the keyboard event to it
The field updates itself to include the new character
When the form is sent, the browser encodes the entered text before sending it to the form target (what encoding?)
Before I go on, is this what actually happens? Have I missed or glossed over anything important?
Next, I'd like to ask: how is the character represented at each of the above steps? At step 1, the keycode could be a device-specific magic number. At step 2, the keyboard driver could convert that to something the O/S understands (for example, the USB HID spec: http://en.wikipedia.org/wiki/USB_human_interface_device_class). What about at subsequent steps? I think the encodings at steps 3 and 4 are OS-dependent and application-dependent (browser), respectively. Can they ever be different, and if yes, how is that problem resolved?
The reason I'm asking is I've run into a problem that is specific to a site that I started to use recently:
Things appear to be working until step 6 above, where the form with the entered text gets submitted, after which the text is mangled beyond recognition. While it's pretty obvious the site isn't handling Unicode input correctly, the incident has led me to question my own understanding of how things work, and now I'm here.
Anatomy of a character from key press to application:
1 - The PC Keyboard:
PC keyboards are not the only type of keyboard, but I'll restrict myself to them.
PC Keyboards surprisingly enough do not understand characters, they understand keyboard buttons. This allows the same hardware used by a US keyboard to be used for QEWERTY or Dvorak and for English in any other language that uses the US 101/104-key format (some languages have extra keys.)
Keyboards use standard scan codes to identify the keys, and to make matters more interesting keyboards can be configured to use a specific set of codes:
Set 1 - used in the old XT keyboards
Set 2 - used currently and
Set-3 used by PS/2 keyboards which no one uses today.
Sets 1 and 2 use make and break codes (i.e. press down and release codes). Set 3 uses make and break codes just for some keys (like shift) and only make codes for letters this allows the keyboard itself to handle key repeat when the key is pressed for long. This is good to offload key repeat processing from the PS/2 8086 or 80286 processor but rather bad for gaming.
You can read more about all this here and I also found a Microsoft specification for scan codes in case you want to build and certify your own 104 key windows keyboard.
In any case we can assume a PC Keyboard using set 2, which means it sends to the computer a code when a key is pressed and one when a key is released.
By the way the USB HID spec does not specify the scan codes sent by the keyboard it only specifies the structures used to send those scan codes.
Now since we're talking about hardware this is true for all operating systems, but how every operating system handles these codes may differ. I'll restrict myself to what happens in Windows, but I assume other operating systems should follow roughly the same path.
2 - The Operating System
I don't know exactly how exactly Windows handles the keyboard, which parts are handled by drivers, which by the kernel and which in user mode; but suffice to say the keyboard is periodically polled for changed to key state and the scan codes are translated and converted to WM_KEYDOWN/WM_KEYUP messages which contain virtual key codes.
To be precise Windows also generates WM_SYSKEYUP/WM_SYSKEYDOWN messages and you can read more about them here
3 - The Application
For Windows that is it, the application gets the raw virtual key codes and it is up to it to decide to use them as is or translate them to a character code.
Nowadays nobody writes good honest C windows programs, but once upon a time programmers used to roll out their own message pump handling code and most message pumps would contain code similar to:
while (GetMessage( &msg, NULL, 0, 0 ) != 0)
{
TranslateMessage(&msg);
DispatchMessage(&msg);
}
TranslateMessage is where the magic happens. The code in TranslateMessage would keep track of the WM_KEYDOWN (and WM_SYSKEYDOWN) messages and generate WM_CHAR messages (and WM_DEADCHAR, WM_SYSCHAR, WM_SYSDEADCHAR.)
WM_CHAR messages contain the UTF-16 (actually the UCS-2 but lets not split hairs) code for the character translated from the WM_KEYDOWN message by taking into account the active keyboard layout at the time.
What about application written before unicode? Those applications used the ANSI version of RegisterClassEx (i.e. RegisterClassExA) to register their windows. In this case TranslateMessage generated WM_CHAR messages with an 8 bit character code based on the keyboard layout and the active culture.
4 - 5 - Dispatching and displaying characters.
In modern code using UI libraries it is entirely possible (though unlikely) not to use TranslateMessage and have custom translation of WM_KEYDOWN events. Standard Window controls (widgets) understand and handle WM_CHAR messages dispatched to them, but UI libraries/VMs running under windows can implement their own dispatch mechanism and many do.
Hope this answers your question.
Your description is more or less correct.
However it is not essential to understand what is wrong with the site.
The question marks instead of characters indicate a translation between encodings has taken place, as opposed to a misrepresentation of encodings (which would probably result in gibberish.)
The characters used to represent letters can be encoded in different ways. For example 'a' in ASCII is 0x61, but 0x81 in EBCDIC. This you probably know, what people tend to forget is that ASCII is a 7 bit code containing only English characters. Since PC computers use bytes as their storage unit, early on the unused top 128 places in the ASCII code where used to represent letters in other alphabets, but which one? Cyrillic? Greek? etc..
DOS used code page numbers to specify which symbols where used. Most (all?) of the DOS code pages left the lower 128 symbols unchanged so English looked like English no matter what code page was used; but try to use a Greek code page to read a Russian text file and you'd end up with Greek and symbols gibberish.
Later Windows added it's own encodings some of the with variable length encodings (as opposed to DOS code pages in which each character was represented by a single byte code,) and then Unicode came along introducing the concept of code points.
Under code points each character is assigned a code point identified by a generic number, i.e. the code point is identified by a number not a 16 bit number.
Unicode also defined encodings to encode code points into bytes. UCS-2 is a fixed length encoding that encodes the code point numbers as 16 bit numbers. What happens to code points with more than 16 bits, simple they cannot be encoded in UCS-2.
When translating from an encoding that supports a specific code point to one that doesn't the character is replaced with a specified character, usually the question mark.
So if I get a transmission in UTF-16 with the hebrew character aleph 'א' and translate it to say latin-1 encoding which has no such character (or formally latin-1 has no code point to represent the unicode code point U+05D0) I'll get a question mark character instead '?'
What is happening in the web site is exactly that, it is getting your input just fine but it is being translated into an encoding that does not support all the characters in your input.
Unfortunately, unlike encoding misrepresentations which can be fixed by manually specifying the encoding of the page there is nothing you can do to fix this on the client.
A related problem is using fonts that do not have the characters shown. In this case you'd see a blank square instead of the character. This problem can be fixed on the client by overriding the site CSS or installing appropriate fonts.

Should I happily stay with UTF-8 or should I be ready to change the encoding?

I've built (or I'm building) an application that supports a wide variety of languages. I'm using UTF-8 right now because as I see it, it supports all languages in this world. (?)
However, after reading the article on Wikipedia, it states that while UTF-8 currently uses only 10% of its potential space, there's a possibility that in the future UTF-8 won't be enough?
Should I write my application (which happens to be a web application) to support other character sets as well? Am I worrying over nothing?
I would definitely stay with UTF-8 for the foreseeable future - it covers just about anything you will typically need, and it's the standard encoding used in most web service and other standards. I don't see any reason for going elsewhere, unless it would not support something that you absolutely need.
I don't see that note in the main Wikipedia article on UTF-8, but in any case, given that whatever encoding you choose, it will eventually be outdated, do not spend undue (read: any) time thinking about this - choose something which does the job now and is standard enough that migrating from it will be easy at such time as you need to. And that's UTF-8.
Unless you have a very specific requirement, like display ancient hieroglyphs, you will be fine with utf-8.
When you encounter a language that you are required to support that utf-8 can't help you with, then do the work to switch to something else.
Realistically, this won't happen and you will just save yourself a sack of time.

Resources