How is '\x1A' special in Windows?

How is '\x1A' special in Windows? - windows

Reading some references1, 2, I learned that the modifier b in the second argument in fopen(3) has no effect in POSIX systems, while it prevents special handling for \n and \x1A in Windows (See below).
I well know how \n (LF) is special in Windows as text files use CRLF for line break (i.e. printf("\n") actually prints \r\n), but how is \x1A (SUB) special?
fopen("D:\\foo.txt", "rb");
^

\x1A is Ctrl+Z, which used to be used as the end-of-file marker in MS-DOS (maybe even as far back as CP/M).
The Microsoft documentation makes no mention of Ctrl+Z under the "b" mode (only under the "t" mode), so this could be cargo cult programming. I don't have a Windows box handy right now, so I can't easily check.

Related

An obscure one: Documented VT100 'soft-wrap' escape sequence?

When connected to a remote BASH session via SSH (with the terminal type set to vt100), the console command line will soft-wrap when the cursor hits column 80.
What I am trying to discover is if the <space><carriage return> sequence that gets sent at this point is documented anywhere?
For example sending the following string
std::string str = "0123456789" // 1
"0123456789"
"0123456789" // 3
"0123456789"
"0123456789" // 5
"012345678 9"
"0123456789_" // 7
"0123456789"
"0";
gets the following response back from the host (Linux Mint as it happens)
01234567890123456789012345678901234567890123456789012345678<WS><WS><CR>90123456789_01234567890

The behaviour observed is not really part of bash; rather, it is part of the behaviour of the readline library. It doesn't happen if you simply use echo (which is a bash builtin) to output enough text to force an automatic line wrap, nor does it happen if bash produces an error message which is wider than the console. (Try, for example, the command . with an argument of more then 80 characters not corresponding to any existing file.)
So it's not an official "soft-wrap sequence", nor is it part of any standard. Rather, it's a pragmatic solution to one of the many irritating problems related to console display management.
There is an ambiguity in terminal implementation of line wrapping:
The terminal wraps after a character is inserted at the rightmost position.
The terminal wraps just before the next character is sent.
As a result, it is not possible to reliably send a newline after the last column position. If the terminal had already wrapped (option 1 above), then the newline will create an extra blank line. Otherwise (option 2), the following newline will be "eaten".
These days, almost all terminals follow some variant of option 2, which was the behaviour of the DEC VT-100 terminal. In the vocabulary of the terminfo terminal description database, this is called xenl: the "eat-newline-glitch".
There are actually two possible subvariants of option 2. In the one actually implemented by the VT-100 (and xterm), the cursor ends up in an anomalous state at the end of the line; effectively, it is one character position off the screen, so you can still backspace the cursor in the same line. Other historic terminals "ate" the newline, but positioned the cursor at the beginning of the next line anyway, so that a backspace would not be possible. (Unless the terminal has the bw capability.)
This creates a problem for programs which need to accurately keep track of the cursor position, even for apparently simple applications like echoing input. (Obviously, the easiest way to echo input is to let the terminal do that itself, but that precludes being able to implement extra control characters like tab completion.) Suppose the user has entered text right up to the right margin, and then types the backspace character to delete the last character typed. Normally, you could implement a backspace-delete by outputting a cub1 (move left 1) code and then an el (clear to end of line). (It's more complicated if the deletion is in the middle of a line, but the principle is the same.)
However, if the cursor could possibly be at the beginning of the next line, this won't work. If you knew the cursor was at the beginning of the next, you could move up and then to the right before doing the el, but that wouldn't work if the cursor was still on the same line.
Historically, what was considered "correct" was to force the cursor to the next line with a hard return. (Following quote is taken from the file terminfo.src found in the ncurses distribution. I don't know who wrote it or when):
# Note that the <xenl> glitch in vt100 is not quite the same as on the Concept,
# since the cursor is left in a different position while in the
# weird state (concept at beginning of next line, vt100 at end
# of this line) so all versions of vi before 3.7 don't handle
# <xenl> right on vt100. The correct way to handle <xenl> is when
# you output the char in column 80, immediately output CR LF
# and then assume you are in column 1 of the next line. If <xenl>
# is on, am should be on too.
But there is another way to handle the issue which doesn't require you to even know whether the terminal has the xenl "glitch" or not: output a space character, after which the terminal will definitely have line-wrapped, and then return to the leftmost column.
As it turns out, this trick has another benefit if the terminal emulator is xterm (and probably other such emulators), which allows you to select a "word" by double-clicking on it. If the automatic line wrap happens in the middle of a word, it would be ideal if you could still select the entire word even though it is split over two lines. If you follow the suggestion in the terminfo file above, then xterm will (quite reasonably) treat the split word as two words, because they have an explicit newline between them. But if you let the terminal wrap automatically, xterm treats the result as a single word. (It does this despite the output of the space character, presumably because the space character was overwritten.)
In short, the SPCR sequence is not in any way a standardized feature of the VT100 terminal. Rather, it is a pragmatic response to a specific feature of terminal descriptions combined with the observed behaviour of a specific (and common) terminal emulator. Variants of this code can be found in a variety of codebases, and although as far as I know it is not part of any textbook or formal documentation, it is certainly part of terminal-handling folkcraft [note 2].
In the case of readline, you'll find a comment in the code which is much more telegraphic than this answer: [note 1]
/* If we're at the right edge of a terminal that supports xn, we're
ready to wrap around, so do so. This fixes problems with knowing
the exact cursor position and cut-and-paste with certain terminal
emulators. In this calculation, TEMP is the physical screen
position of the cursor. */
(xn is the short form of xenl.)
Notes
The comment is at line 1326 of display.c in the current view of the git repository as I type this answer. In future versions it may be at a different line number, and the provided link will therefore not work. If you notice that it has changed, please feel free to correct the link.
In the original version of this answer, I described this procedure as "part of terminal handling folklore", in which I used the word "folklore" to describe knowledge passed down from programmer to programmer rather than being part of the canon of academic texts and international standards. While "folklore" is often used with a negative connotation, I use it without such prejudice. "lore" (according to wiktionary) refers to "all the facts and traditions about a particular subject that have been accumulated over time through education or experience", and is derived from an Old Germanic word meaning "teach". Folklore is therefore the accumulated education and experience of the "folk", as opposed to the establishment: in Eric S. Raymond's analogy of the Cathedral and the Bazaar, folklore is the knowledge base of the Bazaar.
This usage raised the eyebrows of at least one highly-skilled practitioner, who suggested the use of the word "esoteric" to describe this bit of information about terminal-handling. "Esoteric" (again according to wiktionary) applies to information "intended for or likely to be understood by only a small number of people with a specialized knowledge or interest, or an enlightened inner circle", being derived from the Greek ἐσωτερικός, "inner circle". (In other words, the knowledge of the Cathedral.)
While the semantic discussion is, at least, amusing, I changed the text by using the hopefully less emotionally-charged word "folkcraft".

There is more than one reason for making line-wrapping a special case (and "folklore" seems an inappropriate term):
The xterm FAQ That description of wrapping is odd, say more? is one of many places discussing vt100 line-wrapping.
vim and screen both take care to not use cursor-addressing to avoid the wrapping, since that would interfere with selecting a wrapped line in xterm. Instead (and the sample seems to show bash doing this too) they send a series of printable characters which step across the margin before sending other control sequences which would prevent the line-wrapping flag from being set in xterm. This is noted in xterm's manual page:
Logical words and lines selected by double- or triple-clicking may wrap
across more than one screen line if lines were wrapped by xterm itself
rather than by the application running in the window.
As for "comments in code" - there certainly are, to explain to maintainers what should not be changed. This from Sven Mascheck's XTerm resource file gives a good explanation:
! Wether this works also with _wrapped_ selections, depends on
! - the terminal emulator: Neither MIT X11R5/6 nor Suns openwin xterm
! know about that. Use the 'xfree xterm' or 'rxvt'. Both compile on
! all major platforms.
! - It only works if xterm is wrapping the line itself
! (not always really obvious for the user, though).
! - Among the different vi's, vim actually supports this with a
! clever and little hackish trick (see screen.c):
!
! But before: vim inspects the _name_ of the value of TERM.
! This must be similar to "xterm" (like "xterm-xfree86", which is
! better than "xterm-color", btw, see his FAQ).
! The terminfo entry _itself_ doesn't matter here
! (e.g.: 'xterm' and 'vs100' are the same entry, but with
! the latter it doesn't work).
!
! If vim has to wrap a word, it appends a space at the first part,
! this space will be wrapped by xterm. Going on with writing, vim
! in turn then positions the cursor again at the _beginning_ of this
! next line. Thus, the space is not visible. But xterm now believes
! that the two lines are actually a single one--as xterm _has_ done
! some wrapping also...
The comment which #rici quotes came from the terminfo file which Eric Raymond incorporated from SCO in 1995. The history section of the terminfo source refers to this. Some of the material in that is based on the BSD termcap sources, but differs, as one would notice when comparing the BSD termcap in this section with ncurses. The four paragraphs beginning with the "not quite" are the same (aside from line-wrapping) with the SCO file. Here is a cut/paste from that file:
# # --------------------------------
#
# dec: DEC (DIGITAL EQUIPMENT CORPORATION)
#
# Manufacturer: DEC (DIGITAL EQUIPTMENT CORP.)
# Class: II
#
# Info:
# Note that xenl glitch in vt100 is not quite the same as concept,
# since the cursor is left in a different position while in the
# weird state (concept at beginning of next line, vt100 at end
# of this line) so all versions of vi before 3.7 don't handle
# xenl right on vt100. The correct way to handle xenl is when
# you output the char in column 80, immediately output CR LF
# and then assume you are in column 1 of the next line. If xenl
# is on, am should be on too.
#
# I assume you have smooth scroll off or are at a slow enough baud
# rate that it doesn't matter (1200? or less). Also this assumes
# that you set auto-nl to "on", if you set it off use vt100-nam
# below.
#
# The padding requirements listed here are guesses. It is strongly
# recommended that xon/xoff be enabled, as this is assumed here.
#
# The vt100 uses rs2 and rf rather than is2/tbc/hts because the
# tab settings are in non-volatile memory and don't need to be
# reset upon login. Also setting the number of columns glitches
# the screen annoyingly. You can type "reset" to get them set.
#
# smkx and rmkx, given below, were removed.
# smkx=\E[?1h\E=, rmkx=\E[?1l\E>,
# Somtimes smkx and rmkx are included. This will put the auxilliary keypad in
# dec application mode, which is not appropriate for SCO applications.
vt100|vt100-am|dec vt100 (w/advanced video),
If you compare the two, the ncurses version has angle brackets added around the terminfo capability names, and a minor grammatical change was made in the first sentence. But the author of the comment clearly was not Raymond.

Terminal overwriting same line when too long

In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]☔︎ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
☂☃☽☀︎☔︎
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.

The symbol being printed ☔︎ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: ☔︉ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[…\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.

This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '

What does \n\r mean?

When reading from a pseudo-terminal via java, I'm seeing "\n\r" in the text. What is that representative of? Note its not "\r\n" which I'm familiar with.

\n is a line feed (ASCII code 10), \r is a carriage return (ASCII code 13).
Different operating systems use different combinations of these characters to represent the end of a line of text. Unix-like operating systems (Linux, Mac OS X) usually use only \n. MS-DOS and Windows use \r\n (carriage return, followed by a line feed).
The code you're using uses \n\r (line feed, carriage return). There are operating systems that use that sequence, but probably it's a mistake and it should have been \r\n.
See Newline on Wikipedia.
If you're programming in Java and you want to know what the newline sequence is for the operating system that your program is running on, you can get the system property line.separator:
String newline = System.getProperty("line.separator");

How to escape unicode characters in bash prompt correctly

I have a specific method for my bash prompt, let's say it looks like this:
CHAR="༇ "
my_function="
prompt=\" \[\$CHAR\]\"
echo -e \$prompt"
PS1="\$(${my_function}) \$ "
To explain the above, I'm builidng my bash prompt by executing a function stored in a string, which was a decision made as the result of this question. Let's pretend like it works fine, because it does, except when unicode characters get involved
I am trying to find the proper way to escape a unicode character, because right now it messes with the bash line length. An easy way to test if it's broken is to type a long command, execute it, press CTRL-R and type to find it, and then pressing CTRL-A CTRL-E to jump to the beginning / end of the line. If the text gets garbled then it's not working.
I have tried several things to properly escape the unicode character in the function string, but nothing seems to be working.
Special characters like this work:
COLOR_BLUE=$(tput sgr0 && tput setaf 6)
my_function="
prompt="\\[\$COLOR_BLUE\\] \"
echo -e \$prompt"
Which is the main reason I made the prompt a function string. That escape sequence does NOT mess with the line length, it's just the unicode character.

The \[...\] sequence says to ignore this part of the string completely, which is useful when your prompt contains a zero-length sequence, such as a control sequence which changes the text color or the title bar, say. But in this case, you are printing a character, so the length of it is not zero. Perhaps you could work around this by, say, using a no-op escape sequence to fool Bash into calculating the correct line length, but it sounds like that way lies madness.
The correct solution would be for the line length calculations in Bash to correctly grok UTF-8 (or whichever Unicode encoding it is that you are using). Uhm, have you tried without the \[...\] sequence?
Edit: The following implements the solution I propose in the comments below. The cursor position is saved, then two spaces are printed, outside of \[...\], then the cursor position is restored, and the Unicode character is printed on top of the two spaces. This assumes a fixed font width, with double width for the Unicode character.
PS1='\['"`tput sc`"'\] \['"`tput rc`"'༇ \] \$ '
At least in the OSX Terminal, Bash 3.2.17(1)-release, this passes cursory [sic] testing.
In the interest of transparency and legibility, I have ignored the requirement to have the prompt's functionality inside a function, and the color coding; this just changes the prompt to the character, space, dollar prompt, space. Adapt to suit your somewhat more complex needs.

#tripleee wins it, posting the final solution here because it's a pain to post code in comments:
CHAR="༇"
my_function="
prompt=\" \\[`tput sc`\\] \\[`tput rc`\\]\\[\$CHAR\\] \"
echo -e \$prompt"
PS1="\$(${my_function}) \$ "
The trick as pointed out in #tripleee's link is the use of the commands tput sc and tput rc which save and then restore the cursor position. The code is effectively saving the cursor position, printing two spaces for width, restoring the cursor position to before the spaces, then printing the special character so that the width of the line is from the two spaces, not the character.

(Not the answer to your problem, but some pointers and general experience related to your issue.)
I see the behaviour you describe about cmd-line editing (Ctrl-R, ... Cntrl-A Ctrl-E ...) all the time, even without unicode chars.
At one work-site, I spent the time to figure out the diff between the terminals interpretation of the TERM setting VS the TERM definition used by the OS (well, stty I suppose).
NOW, when I have this problem, I escape out of my current attempt to edit the line, bring the line up again, and then immediately go to the 'vi' mode, which opens the vi editor. (press just the 'v' char, right?). All the ease of use of a full-fledged session of vi; why go with less ;-)?
Looking again at your problem description, when you say
my_function="
prompt=\" \[\$CHAR\]\"
echo -e \$prompt"
That is just a string definition, right? and I'm assuming your simplifying the problem definition by assuming this is the output of your my_function. It seems very likely in the steps of creating the function definition, calling the function AND using the values returned are a lot of opportunities for shell-quoting to not work the way you want it to.
If you edit your question to include the my_function definition, and its complete use (reducing your function to just what is causing the problem), it may be easier for others to help with this too. Finally, do you use set -vx regularly? It can help show how/wnen/what of variable expansions, you may find something there.
Failing all of those, look at Orielly termcap & terminfo. You may need to look at the man page for your local systems stty and related cmds AND you may do well to look for user groups specific to you Linux system (I'm assuming you use a Linux variant).
I hope this helps.

Emacs showing ^M in a process buffer

At the moment, I have a process-buffer which is utf-8-auto (emacs modeline reports the buffer as utf-8-auto-dos) with CRLF style newlines. When I write multi-line text into the buffer via a process-send-region or process-send-string each line is suffixed with ^M.
What makes this problem odd is that text written to the process-buffer directly from the process, does not contain ^M's.
It doesn't seem to make any difference where the source text comes from, in fact, even a multi-line region marked and sent that already appears in the process buffer (that doesn't contain ^M) will have them when sent.
(Note the source text for the process-send-region will always come from a Emacs buffer, process-send-string, when multi-line will be from the Windows clipboard interface to the killring, or again from an Emacs buffer to killring.)
I should also add that the incoming text to the buffer is parsed by a after-change-functions hook (to do some colorisation based on input) so a last resort I'd do an additional regexp-replace-in-string on this incoming text as part of that hook function, I'd like to avoid that because it seems wrong, but I'll add it as a hacky solution if nothing else works.
Addendum
I updated the encoding settings for the buffer and the process to use utf-8-dos instead of utf-8-auto and the ^M's vanished.
So in the buffer setup part of my app, I did...
(switch-to-buffer "sock-buffer")
(set-process-coding-system (get-process sock-process) 'utf-8-dos 'utf-8-dos)
(set-buffer-file-coding-system 'utf-8-dos nil)
(set-buffer-process-coding-system 'utf-8-dos 'utf-8-dos)
Then reduced this to just...
(switch-to-buffer "sock-buffer")
(set-buffer-process-coding-system 'utf-8-dos 'utf-8-dos)
And everything worked fine.

This is because those files are in DOS/Windows line endings. You can use C-x [Enter] f unix [Enter] to convert them to the Unix encoding.
^L is a page break. I've seen them some times to separate different parts of source code (for old-fashioned listings in a text printer), or in text documentation to insert an actual "new page" command.
As of the update, here you can see that you have to select set-process-coding-system to the correct coding system.

Alternately to the dos2unix approach, you could use one of the MULE commands in Emacs, or (my favorite), since these characters are mistakenly treated as part of the text, you can replace them using the command to replace a string in the text: M-% C-q C-M RETURN
M-% is the query-replace command.
C-q means "let me type the next character without interpreting it as the RETURN key".

I believe you see those because of the inconsistencies in your newlines (e.g. windows newlines vs *nux ones), you should probably try dos2unix

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio