Are VHDL character substitutions ever used in real life? - vhdl

VHDL allows the following substitutions, presumably because some computers might not support the vertical bar (or pipe symbol) (|) or the hash (or pound sign / number sign) (#):
case A|B can be written as case A!B
16#fff# can be written as 16:fff:
Any computer nowadays supports the vertical bar and the hash symbol, so I figured nobody uses these substitutions... Until somebody requested support for the exclamation mark.
My question: is this a lone case or are other people also using the exclamation mark as substitute for the vertical bar? Anybody using the colon?

Data point 1: Not me :)
And I've never seen it as far as I recall in any code - nor was I taught it at any point (in fact, this is the first I knew of those substitutions).
I had a quick look in Ashenden's Designer's Guide to VHDL, and the ! alternative is not even mentioned when the | is introduced for case statements.

These are inherited from Ada (in which they are obsolescent since Ada95). The Ada83 Rationale says "For portability reasons, it is possible to write any program in a 56 character subset of the ISO character set." in which ISO character set must be understood as ISO-646, aka ASCII (well ISO-646 has provision for replacing some characters for national variant, ASCII can be understood as the US national variant of ISO-646)
There is a third replacement: % can be use instead of " as string delimitor (both must be replaced).
I seem to remember that EBCDIC is using | or ! for the same code point depending on the variant.

Related

What are valid date-time separators in RFC3339 strings?

I'm quite confused as to what's allowed as the time separator/designator in the RFC3339 standard. By time separator I mean the sequence of characters that draw the line between date and time.
The standard states in section 5.6 different things that are unclear or conflicting. First of all, it says that the production rule for a full datetime is this:
date-time = full-date "T" full-time
Meaning that the delimiter between the date and the time is an uppercase T. Right after comes this:
NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively
Meaning the upper case T may be a lower case t. It conflicts with the ABNF, but OK, it stills sounds to me within the realm of reasonable. Then the following is stated
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character but anything? which is what this say implies. Or does it by this syntax refer to ISO8601 and unnecessarily describes a detail of that other standard?
In other words, are the following valid RFC3339 strings?
2020-09-07 20:26:03.623359300+02:00
2020-09-07hey johnny20:26:03.623359300+02:00
2020-09-07💩20:26:03.623359300+02:00
Meaning the upper case T may be a lower case t. It conflicts with the ABNF, [...]
It does not. See 2.3 Terminal Values of RFC 2234:
Literal text strings are interpreted as a concatenated set of
printable characters.
NOTE: ABNF strings are case-insensitive and
the character set for these strings is us-ascii.
So it is allowed to use t here.
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character
but anything?
This "deviation" is used for readability to the user when displayed. So when the value is displayed to the user in some kind, it can be displayed as:
2020-09-07 20:26:03.623359300+02:00
2020-09-07, 20:26:03.623359300+02:00
That way it might be easier for the user to see the clear space between the date and time, so they don't have to look for the T or t character to find the separation. It is indeed a vague sentence as it basically mean the application can do anything.
To answer your question: These listed date formats are not valid according to RFC 3339.
Short answer: T (or t as discouraged alternative).
After reading on this as much as I could, it turns out the time separator must be a T or t. What has made think this way is first of all this thread in the GNU lists where F. Alexander Njemz contacted the authors of RFC3339 Graham Klyne and Chris Newman asking if T is mandatory and got this response from Mr. Klyne:
In short: "yes"
Per section 5.5, the intent in this draft was to specify a timestamp format using
elements from and compatible with 8601, but eliminating as far as
reasonable any variations that could make timestamp data harder to
process. This includes making the 'T' mandatory in date+time values.
#g
Just for clarity's sake, this is stated in the section 5.5:
Simplicity is achieved by making most fields and punctuation
mandatory.
This clearly clashes with a non-mandatory T and strongly makes me think that the this syntax in that problematic passage refers to ISO8601 and not RFC3339.
For those who want to read more, here are some links regarding the confusion created by this specific point:
https://lists.gnu.org/archive/html/bug-coreutils/2006-05/msg00014.html
http://validator.w3.org/feed/docs/error/InvalidRFC3339Date.html
https://www.rfc-editor.org/errata/eid5783
Plus of course divergent implementations. For instance, the developers of GNU Date chose to use a space character:
$ date --rfc-3339=seconds
2020-09-14 14:53:51+02:00

What is the meaning of special character sequences like `\027[0K`?

I found this commit from facebook infer, and I have no idea what \027[0K and \027[%iA means.
What does these special string mean? And (I think) if there are more strings like this, where can I find the full documentation about this?
Those are escape sequences to tell your terminal what to do.
For example, the sequence of characters represented by \027[0K (where \027 is ASCII decimal value for Esc character) tells the terminal to "clear line from cursor to the end."
One helpful document/guide on this subject can be found at https://shiroyasha.svbtle.com/escape-sequences-a-quick-guide-1
The facebook code is copied from another source here, which uses hard-coded formatters imitating termcap (this page gives some background). The original has comments indicating where its information came from.
The formatter uses "%i" for integers. That's a repeat-count for the cursor movement "cursor-up" \033[A
In most languages, \033 (octal) is used for the ASCII escape character. But this source (according to the github analysis) is written in OCaml, and is using the decimal value for the ASCII escape character. According to the OCaml syntax, you could use an octal value like this: \o033
Once you see that the formatting parts (how the escape character is represented, the use of %i to format a number), the rest of this is documented in several places.
The relevant standard is ECMA-48
the termcap (or analogous terminfo) information is in the terminal database.

Why do xterm's docs call ' ' a control character?

I'm writing a parser for ANSI escape codes using xterm's docs as a guideline. Under the list of single character functions, they include:
SP Space.
Now, for most of the single character functions, I understand the purpose: BEL, for example, is going to require some special help from your terminal emulator to process, and TAB is likely to be involved in autocompletion rather than being printed as a normal character.
I can't imagine any situation where SP would need to be treated as anything other than a literal space character, so I'm considering dropping the SP control code from my parser. Would I risk anything by doing so? Is there a use for SP in the console that I'm not aware of?
Space isn't a "control" character. In ASCII, the control characters are codes 0 to 31 (space is 32), and 127 (DEL). The POSIX locale uses the same data, not coincidentally.
They are called control characters, because they allow the host (computer) to control (tell) the terminal to perform functions rather than simply print text:
A space is actually "printing" in this regard because (like all of the other ASCII characters), it advances the carriage position by one column. In the C language of course, a space is treated as non-graphic, which is a different shade of meaning. "Graphic" characters are visible.
In contrast, a TAB requires the terminal to do something special: move the carriage position by an amount that depends on where it happens to be at the moment.
"Carriage position" of course refers to printing terminals (such as those on which Unix was originally developed), or typewriters. The "carriage" (noun) is the mechanism which moved left/right to allow the terminal (or typewriter) to print at different positions along the line. "Carriage controls" in turn refer to the control characters which move the carriage left and right (other than as a side-effect of printing individual characters). It's obvious if you have ever used a typewriter...
In XTerm Control Sequences, SP is shown for clarity (to be able to reuse that name in other places, e.g., where a 32 is actually part of a control sequence). That wording was added in patch #25 to support the description of the group of controls S7C1T, S8C1T, and DECSCL — setting ANSI conformance level, none of which fall within ECMA-48.
A quick check shows 8 control sequences containing a space (which happens to be a valid intermediate byte, per ECMA-48, just like semicolon, which is visually distinct and does not require a name in the control sequences descriptions — you might find the PDF clearer than the HTML). None of those sequences are used in the obscure sense referred to in ECMA-48:
ECMA 48 section 6.1.1 is talking about overstriking one character on another to render a mixture of the two. This is very rare in video terminals, but assumed in most printing devices. The closest to this in a terminfo description might be ul (underline character overstrikes), and reviewing the few possibilities, some of those appear to be incorrect. xterm doesn't do that.
ECMA 48 section 8.3.140 in its comment about "character escapement" is referring to proportional fonts or variable-width character pitch (again, very rare in video terminals, but implemented in some printing devices). There are a few terminfo capabilities referring to pitch, and all of those are marked as "printer support". ncurses has one entry (att5310) using the cpi capability.
So: if you are referring to xterm's documentation, it is unlikely that you intend your parser for any other use than for video terminals. But if you intend it to be more general, then reading about printers would be a good way to improve your application.
ECMA 48 sheds some light on this.
tl;dr:
Some terminals may choose to differentiate between erased characters and space characters.
In terminals with variable width fonts, SP can be considered a control character that introduces a configurable amount of horizontal spacing.
Neither is really relevant today, so you're entirely free to just treat as just another character.
ECMA 48 section 6.1.1:
Depending on the implementation, there may or may not be a distinction between a character position in
the erased state and a character position imaging SPACE
ECMA 48 section 8.3.140:
SSW is used to establish for subsequent text the character escapement associated with the character
SPACE. The established escapement remains in effect until the next occurrence of SSW in the data
stream or until it is reset to the default value by a subsequent occurrence of CARRIAGE RETURN/LINE
FEED (CR/LF), CARRIAGE RETURN/FORM FEED (CR/FF), or of NEXT LINE (NEL) in the data
stream, see annex C.

Terminal overwriting same line when too long

In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]☔︎ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
☂☃☽☀︎☔︎
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.
The symbol being printed ☔︎ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: ☔︉ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[…\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.
This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '

Is there any character that is illegal in file paths on every OS?

Is there any character that is guaranteed not to appear in any file path on Windows or Unix/Linux/OS X?
I need this because I want to join together a few file paths into a single string, and then split them apart again later.
In the comments, Harry Johnston writes:
The generic solution to this class of problem is to encode the file paths before joining them. For example, if you're dealing with single-byte strings, you could convert them to hex strings; so "hello" becomes "68656c6c6f". (Obviously that isn't the most efficient solution!)
That is absolutely correct. Please don't try to do anything "tricky" with filenames and reserved characters, because it will eventually break in some weird corner case and your successor will have a heck of a time trying to repair the damage.
In fact, if you're trying to be portable, I strongly recommend that you never attempt to create any filenames including any characters other than [a-z0-9_]. (Consider that common filesystems on both Windows and OS X can operate in case-insensitive mode, where FooBar.txt and FOOBAR.TXT are the same identifier.)
A decently compact encoding scheme for practical use would be to make a "whitelisted set" such as [a-z0-9_], and encode any character ch outside your "whitelisted set" as printf("_%2x", ch). So hello.txt becomes hello_2etxt, and hello_world.txt becomes hello_5fworld_2etxt.
Since every _ is escaped, you can use double-_ as a separator: the encoded string hello_2etxt__goodbye___2e_2e uniquely identifies the list of filenames ['hello.txt', 'goodbye', '..'].
You can use a newline character, or specifically CR (decimal code 13) or LF (decimal code 10) if you like. Whether this is suitable or not depends on what requirements you have with regard to displaying the concatenated string to the user - with this approach, it will print its parts on separate lines - which may be very good or very bad for the purpose (or you may not care...).
If you need the concatenated string to print on a single line, edit your question to specify this additional requirement; and we can go from there then.

Resources