UTF-8 on Windows with Ada - windows

It is my understanding that by default, Character is Latin_1, Wide_Character is UCS-2, and Wide_Wide_Character is UCS-4, but that GNAT can have specified pragma Wide_Character_Encoding(UTF8); or -gnatW8 and that those characters and their strings will be UTF-8 encoded instead.
At least on Linux and FreeBSD, the results fit with my expectations. But on Windows the results are odd.
For either Wide or Wide_Wide variants, once a character moves beyond the ASCII set, I get a garbled mess. I beleive this is called emojibake by some. So I figured it was a codepage issue. After all, the default codepage in Windows, and therefore what the Console Host would load with, is 437 which isn't the UTF-8 codepage. chcp 65001 and now instead of the mess of extra characters, there's an immediate exception raised ADA.IO_EXCEPTIONS.DEVICE_ERROR : a-ztexio.adb:1295. Looking at where the exception occurred, it seems to be in the putc binding of fputc(). But this is Standard_Output, shouldn't an EOF never happen?
Is there some kind of special consideration Windows needs? How can I get UTF-8 output?
edit:
I tried piping the output into a text file. The supposed UTF-8 encoded program still generates emojibake in the file. Not sure why this would immediately throw an exception in the console though.
So then I tried directly opening and writing to a file instead of the console/pipe. Oddly this works exactly as it should. The text is completely correct.
I've never seen this kind of behavior with any other language, so it should still be possible to get proper UTF-8 at the console, right?

The deficiency so many others, not just here, describe in the Windows Console Host has either been fixed or never existed in the first place. Based on this document, I feel it was probably always very misunderstood. Windows doesn't treat the console like files, and it's easy to fall into that trap.
Using this very straight forward code, along with what Windows needs and expects behind the scenes...
It correctly produces the following, as long as either pragma Wide_Character_Encoding(UTF8); or -gnatW8 is used.
Piping the output of this test program into a file works as it should. Similarly, piping the output of this test program into another program works as it should. And also similarly, taking the file from piped output, and piping it into another program works as it should.
Full UTF-8 behavior as one would expect under Linux, on Windows.
What needs to be done is twofold. In the package initializer, the Console Host needs to be told what it's working with, which can be done like this.
Character output is then done through fputwc. According to MS Docs fputc should never be used for UNICODE on Windows, which is part of the problem GNAT has. String output and character/string input is all similar.

Based on others comments and some further research to confirm, I'm pretty sure this is a deficiency of the Windows Console Host.
edit: don't listen to this

Related

Flicker free console updates with virtual terminal sequences

In a C# console application for Windows, I'm am using the Windows Console API WriteConsoleOutput (via PInvoke) to write an entire buffer in a single operation to prevent flickering. This works fine.
Microsoft recommends using virtual terminal sequences to interact with the console. These sequences are great, as they offer much better output, such as colors, etc.
But, as I understand it, WriteConsoleOutput cannot be used with escape sequences (see CHAR_INFO).
My question is,
How can I use virtual terminal sequences to write to the console flicker-free?
I'd like to update different parts of the screen with different characters and colors. Doing this by chaining a lot of Console.Write() and Console.SetCursorPosition will cause a lot of flickering and reduce framerate.
What is the virtual terminal equivalent of writing an entire buffer?
I hate to answer my own question, but I have found a solution after a couple of days' experimenting.
The answer to this question:
What is the virtual terminal equivalent of writing an entire buffer?
There is none.
Not exactly.
I haven't found anything similar to WriteConsoleOutput, which renders a pre-populated buffer of an arbitrary size.
However, virtual terminals have the concept of alternate screen buffers, which can be created by CreateConsoleScreenBuffer and switched with SetConsoleActiveScreenBuffer.
Using that, I came up with a solution for my first question:
How can I use virtual terminal sequences to write to the console flicker-free?
Basically, what I do is I create a new buffer and then use WriteConsole to write VT escape sequences to the back buffer before I switch the buffers.
The thing I don't like about this solution is the call to WriteConsole. There will be a lot of them (writing characters/sequences one by one), or there will be few of them when writing long pre-escaped strings.
In order to test flickering, I created a single string with 120 x 30 characters, with each character given a 256-color. This produced a string of over 41'000 characters, which was used as input to WriteConsole.
This actually seems to work pretty well!
This is the best solution I have found so far. If you find a better one, please write your own answer here!

Strange thing in clear command

Today I was just experimenting with Linux shell. Here is what I did.
I wrote the output of clear to a file :
clear > clear.txt
Now I have some contents on the shell, (non-empty). Then I try to cat the contents of clear.txt
cat clear.txt
For my surprise, the entire screen got cleared. Can somone explain me why? if this is true, why cant we do the same for all the commands?
clear works very simply by transmitting a sequence of control codes which are interpreted by your terminal. The magic of actually clearing the screen is handled by your terminal (or terminal emulator, or console interface, or whatever you happen to be using) in response to receiving these control codes.
See also e.g. https://en.wikipedia.org/wiki/ANSI_escape_code
Well it's not a real explanation but it's how it is supposed to work:
http://man7.org/linux/man-pages/man1/clear.1.html
#CLEAR# writes to the standard output. You can redirect the standard
output to a file (which prevents #CLEAR# from actually clearing the
screen), and later cat the file to the screen, clearing it at that
point.
It is a visual representation of the clear command. Editing the text file yields:
^[[3J^[[H^[[2J
source: https://unix.stackexchange.com/questions/400142/terminal-h2j-caret-square-bracket-h-caret-square-bracket-2-j

Emacs: some programs only work in ansi-term, some programs only work in shell

Relative Emacs newbie here, just trying to adapt my programming workflow to fit with emacs. So far I've discovered shell-pop and I'm quite enjoying on-demand terminals that pop up when needed for banging out the odd commands.
What I understand so far about Emacs is that shell is a "dumb" terminal that doesn't support any ansi control codes, and that makes it incompatible with things like ncurses that attempt to draw complex UI's on a terminal emulator. This is why you can't use less or top or similar in shell-mode.
However, I seem to be having trouble with ansi-term, it's not the be-all, end-all that it's cracked up to be. Sure, it has no problems running less or git log or even nano, but there are a few things that can't quite seem to display properly when they're running in an ansi-term, such as apt-get and nosetests. I'm not sure quite what the name is for it, but apt-get's output is characterised by live-updating what is displayed on the very last line, and then having unchanging lines of text scroll out above that line. It seems to be halfway between something like less and something dumber, like cat. Somehow ansi-term doesn't like this at all, and I get very garbled output, where it seems to output everything on one line only or just generally lose it's place and output things all over, randomly. In the case of nosetests, it starts off ok, but if any libraries spew out any STDERR, the output all goes to hell in a similar way.
With some fiddling it seems possible to fix this by mashing C-l and RET, but it's not always reliable.
Does anybody know what's going on here? Is there some way to fix ansi-term so that it can display everything properly? Or is there perhaps some other mode that I don't know about that is way better? Ideally I'd like something that "just works" as effortlessly as, eg, Gnome Terminal, which can run all of the above mentioned programs without a single hiccup.
Thanks!
I resolved this issue by commenting out my entire .emacs.el and then uncommenting and restarting emacs for every single line in the file. I discovered that the following line alone was responsible for the issue:
'(fringe-mode 0 nil (fringe))
(this line disables the fringes from inside custom-set-variables).
I guess this is a bug in Emacs, that disabling the fringe causes term-mode to garble it's output really badly whenever any output line exceeds $COLUMN columns.
Anyway, I don't really like the fringes much at all, and it seems I was able to at least disable the left fringe without triggering this issue:
(set-fringe-mode (cons 0 8))
Maybe apt-get does different things based on the $TERM environment variable. What happens if you set TERM=dumb? If that makes things work, then you can experiment with different values until you find one that supports enough features but still works.
Note that git 2.0.1 (June 25th, 2014) now better detects dumb terminal when displaying verbose messages.
That might help Emacs better display some of the messages received from git, but the fringe-mode bug reported above is certainly the main cause.
See commit 38de156 by Michael Naumov (mnaoumov)
sideband.c: do not use ANSI control sequence on non-terminal
Diagnostic messages received on the sideband #2 from the server side are sent to the standard error with ANSI terminal control sequence "\033[K" that erases to the end of line appended at the end of each line.
However, some programs (e.g. GitExtensions for Windows) read and interpret and/or show the message without understanding the terminal control sequences, resulting them to be shown to their end users.
To help these programs, squelch the control sequence when the standard error stream is not being sent to a tty.

In what situation should I use ASCII to transfer a file over FTP? (I'm not asking the diff between ascii xfer and bin xfer)

I understand the difference between ASCII mode vs binary when it comes to FTP, but what I don't understand is why there is even a need for ASCII mode at all? Is this just a legacy thing that used to save time by eliminating the most significant bit, therefore causing the overall speed of the transfer to increase by 1/8th? Or is there some hidden use for it that I don't know about?
I've encountered many problems because I would forget to switch the mode to bin when transferring text between different OS's. I don't understand why "bin" isn't just the default for everything, especially with today's much faster internet speeds.
Knowwutimean, Vern?
ASCII mode exists so you can get the right answer when you upload a text file to a remote system without having to know what the line termination or character set conventions are for that system. It was more important when transferring text files was more often done via FTP than, say, email.
To address your practical problem: check the documentation for both your FTP client and server(s) to see if there's a way to set ASCII mode by default. Often this is as simple as some kind of "profile" that sends some FTP commands every time you connect.
To address your philosophical problem: FTP is a 40 year old protocol that has its fair share of historical baggage. One day you'll be very glad that some protocol you depend on was standardized long ago and you can still access some old data.
I, for one, vote to eliminate ascii mode from ftp servers. Any EOL translation can be done by applications consuming the files, and many apps today understand both EOL types anyway. At a minimum, I'd like to see servers switch to using binary by default, and only use ascii if requested.
One scenario of practical use of ASCII mode is to upload PHP or Perl or similar scripts from Windows development machine to Unix server. Use of Binary mode would require separate conversion of line ending sequences, while with ASCII mode conversion is performed "automatically".
Update: there's one more scenario that we have come across - when transferring data to/from mainframes that use EBCDIC encoding, ASCII mode tells the server to perform conversion between encodings.
Here's a practical example of a problem that comes from using a binary FTP connection. In php there are two types of comments:
// a single line comment like this
/* a block comment like this */
The block comment has a start and an end. But the single line comment just ends at the end of the line.
If you upload a php file with single line comments using a binary connection, the php will stop running as soon as it hits the single line comment. It doesn't recognise the end of the line as the end of the comment, so it effectively comments out the rest of your php script.
If however you use FTP in ASCII mode, it will correctly read the end of the line and will run your php code as expected.

Win32 Console -- Backspace to Last Line

I'm writing a command interpreter like BASH, and a \ followed by a newline implies a continuation of the input stream; how can I implement that in Win32?
If I use the console mode with ENABLE_LINE_INPUT, then the user can't press backspace in order to go back to the previous line; Windows prevents him from doing so. But if I don't set ENABLE_LINE_INPUT, then I have to manually reposition the cursor, which is rather tedious given that (1) the user might have redirected the input stream, and that (2) it might be prone to race conditions, and I'd rather have Windows do it if I can.
Any way to have my newline and eat it too?
Edit:
If this would require undocumented CSRSS port requests, then I'm still interested!
Assuming you want this to run in a window, like command prompt does by default, rather than full screen, you could create a GUI application with a large textbox. Users would type into the textbox, and you could parse whatever was entered, and output to the same box (effectively emulated the Win32 Console).
This way whatever rules you want to set for how the console behaves is completely up to you.
I might be mistaken is saying this, but I believe that the Win32 Console from XP onward works exactly like this, and it just listens for output on stdout; there shouldn't be any reason you can't do the same.
Hope this was helpful.

Resources