Why are non-printable ASCII characters actually printable? - ascii

Characters, that are not alphanumeric or punctuation are termed not printable:
Codes 20hex to 7Ehex, known as the printable characters
So why is e.g. 005 representable (and represented by clubs)?

Most of the original set of ASCII control characters are no longer useful, so many different vendors have recycled them as additional graphic characters, often dingbats as in your table. However, all such assignments are nonstandard, and usually incompatible with each other. If you can, it's better to use the official Unicode codepoints for these characters. (Similar things have been done with the additional block of control characters in the high half of the ISO 8859.x standards, which were already obsolete at the time they were specified. Again, use the official Unicode codepoints.)
The tiny print at the bottom of your table appears to say "Copyright 1982 Leading Edge Computer Products, Inc." That company was an early maker of IBM PC clones, and this is presumably their custom ASCII extension. You should only pay attention to the assignments for 000-031 and 127 in this table if you're writing software to convert files produced on those specific computers to a more modern format.

The representation of the "not printable" chars depends on the used charset (of the OS, of the Browser, what ever), see ISO 8859, Code Page 1252 for example.
In dos for example you do have funny Signs that were used for very old style window frames (ascii art like).

Related

What are Unicode codepoint types for?

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.
What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?
Texts have many different meaning and usages, so the question is difficult to answer.
First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.
Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.
The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).
Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.
And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.
The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
Where? It merely outlines advantages and disadvantages of code points. Two examples are:
Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
In other words: code points just index which graphemes Unicode supports.
Sometimes they're meant as single characters: one prominent example would be € (EURO SIGN), having only the code point U+20AC.
Sometimes the same character has multiple code-points as per context: the dollar sign exists as:
﹩ = U+FE69 (SMALL DOLLAR SIGN)
$ = U+FF04 (FULLWIDTH DOLLAR SIGN)
💲 = U+1F4B2 (HEAVY DOLLAR SIGN)
Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.
Sometimes multiple code points can be combined to form up a single character:
á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
á = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete á (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.
Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.
As for homoglyphs code points clearly distinguish the contextual meaning:
A = U+0041 (LATIN CAPITAL LETTER A)
Α = U+0391 (GREEK CAPITAL LETTER ALPHA)
А = U+0410 (CYRILLIC CAPITAL LETTER A)
Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.
Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:
今 = U+4ECA
入 = U+5165
才 = U+624D
Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

What is meaning of assume char set is ASCII?

I was solving below problem while reading its solution in first line I read this
can anyone help me in explaining assume char set is ASCII **I Don't want any other solution for this problem I just want to understand the statement **
Implement an algorithm to determine if a string has all unique characters. What if you can not use additional data structures
Thanks in advance for the help.
There is no text but encoded text.
Text is a sequence of "characters", members of a character set. A character set is a one-to-one mapping between a notional character and a non-negative integer, called a codepoint.
An encoding is a mapping between a codepoint and a sequence of bytes.
Examples:
ASCII, 128 codepoints, one encoding
OEM437, 256 codepoints, one encoding
Windows-1252, 251 codepoints, one encoding
ISO-8859-1, 256 codepoints, one encoding
Unicode, 1,114,112 codepoints, many encodings: UTF-8, UTF-16, UTF-32,…
When you receive a byte stream or read a file that represents text, you have to know the character set and encoding. Conversely, when you send a byte stream or write a file that represents text, you have let the receiver know the character set and encoding. Otherwise, you have a failed communication.
Note: Program source code is almost always text files. So, this communication requirement also applies between you, your editor/IDE and your compiler.
Note: Program console input and output are text streams. So, this communication requirement also applies between the program, its libraries and your console (shell). Go locale or chcp to find out what the encoding is.
Many character sets are a superset of ASCII and some encodings map the same characters with the same byte sequences. This causes a lot of confusion, limits learning, promotes usage of poor terminology and the partial interoperablity leads to buggy code. A deliberate approach to specifications and coding eliminates that.
Examples:
Some people say "ASCII" when they mean the common subset of characters between ASCII and the character set they are actually using. In Unicode and elsewhere this is called C0 Controls and Basic Latin.
Some people say "ASCII Code" when they just mean codepoint or the codepoint's encoded bytes (or code units).
The context of your question is unclear but the statement is trying to say that the distinct characters in your data are in the ASCII character set and therefore their number is less than or equal to 128. Due to the similarity between character sets, you can assume that the codepoint range you need to be concerned about is 0 to 127. (Put comments, asserts or exceptions as applicable in your code to make that clear to readers and provide some runtime checking.)
What this means in your programming language depends on the programming language and its libraries. Many modern programming languages use UTF-16 to represent strings and UTF-8 for streams and files. Programs are often built with standard libraries that account for the console's encoding (actual or assumed) when reading or writing from the console.
So, if your data comes from a file, you must read it using the correct encoding. If your data comes from a console, your program's standard libraries will possibly change encodings from the console's encoding to the encoding of the language's or standard library's native character and string datatypes. If your data comes from a source code file, you have to save it in one specific encoding and tell the compiler what that is. (Usually, you would use the default source code encoding assumed by the compiler because that generally doesn't change from system to system or person to person.)
The "additional" data structures bit probably refers to what a language's standard libraries provide, such as list, map or dictionary. Use what you've been taught so far, like maybe just an array. Of course, you can just ask.
Basically, assume that character codes will be within the range 0-127. You won't need to deal with crazy accented characters.
More than likely though, they won't use many, if any codes below 32; since those are mostly non-printables.
Characters such as 'a' 'b' '1' or '#' are encoded into a binary number when stored and used by a computer.
e.g.
'a' = 1100001
'b' = 1100010
There are a number of different standards that you could use for this encoding. ASCII is one of those standards. The other most common standard is called UTF-8.
Not all characters can be encoded by all standards. ASCII has a much more limited set of characters than UTF-8. As such an encoding also defines the set of characters "char set" that are supported by that encoding.
ASCII encodes each character into a single byte. It supports the letters A-Z, and lowercase a-z, the digits 0-9, a small number of familiar symbols, and a number of control characters that were used in early communication protocols.
The full set of characters supported by ASCII can be seen here: https://en.wikipedia.org/wiki/ASCII

Extended ASCII code above 255

I was always under the assumption ASCII codes ranged from 0 to 255. Last night I had to deal with a character that I thought was an underscore but turned out to be Chr(8230). Three little dots resembling an underscore. This was in an AutoHotKey script. Problem solved but it left me with questions.
I found a table with Chr(8230) and more.
http://www.cjboco.com/blog.cfm/post/table-of-ascii-characters-and-symbols-for-coldfusion/
There's a vague reference to these codes and Coldfusion which just added to the Confusion.
Out of curiosity, what are the codes above 255 referred to as and are there more tables like this? I know they are not Extended ASCII (128 to 255) but can't find any reference to them other that the above chart.
A simple name will be enough. I'm a retired tech with limited programming and internet searching abilities and really don't care if a question like this is beneath some here. If it ruffles a few feathers then so be it, the voting system here is absolutely meaningless to me. :)
That sounds like a unicode character. A multi-byte character set that accommodates many characters from different languages that the standard ASCII character set could not reproduce. http://www.fileformat.info/info/unicode/char/2026/index.htm
Here is the Wikipedia entry on Unicode: https://en.wikipedia.org/wiki/Unicode
Here's one of many Unicode tables available online: http://unicode-table.com/en/#control-character

Typing ALT + 251 and ALT + 0251 at the keyboard produce different character entries

In Windows:
when I press Alt + 251, I get a √ character
when I press Alt + 0251 get û character!
A leading zero doesn't have value.
Actually, I want get check mark(√) from Chr(251) function in Client Report Definition (RDLC) but it gets me û!
I think it interprets four numbers as hex not decimal.
Using a leading zero forces the Windows to interpret the code in the Windows-1252 set. Without 0 the code is interpreted using the OEM set.
Alt+251:
You'll get √, because you'll use OEM 437, where 251 is for square root.
I'll get ¹, because I'll use OEM 850, where 251 is for superscript 1.
Alt+0251:
Both of us will get û, because we'll use Windows-1252, where 251 is for u-circumflex.
This is historical.
From ASCII to Unicode
At the beginning of DOS/Windows, characters were one byte wide and were from the American alphabet, the conversion was set using the ASCII encoding.
Additional characters were needed as soon as the PC was used off the US (many languages use accents for instance). So different codepages were designed and different encoding tables were used for conversion.
But a computer in the US wouldn't use the same codepage than one in Spain. This required the user and the programmer to assume the currently active codepage, and this has been a great period in the history of computing...
At the same period it was determined that using only one byte was not going to make it, more than 256 characters were required to be available at the same time. Different encoding systems were designed by a consortium, and collectively known as Unicode.
In Unicode "characters" can be one to four bytes wide, and the number of bytes for one character may vary in the same string.
Other notions have been introduced, such as codepoint and glyph to deal with the complexity of written language.
While Unicode was being adopted as a standard, Windows retained the old one-byte codepages for efficiency, simplicity and retro-compatibility. Windows also added codepages to deal with glyphs found only in Unicode.
Windows has:
A default OEM codepage which is usually 437 in the US -- your case -- or 850 in Europe -- my case --, used with the command line ("DOS"),
the Windows-1252 codepage (aka Latin-1 and ISO 8859-1, but this is a misuse) to ease conversion to/from Unicode. Current tendency is to replace all such extended codepages by Unicode. Java designers make a drastic decision and use only Unicode to represent strings.
When entering a character with the Alt method, you need to tell Windows which codepage you want to use for its interpretation:
No leading zero: You want the OEM codepage to be used.
Leading zero: You want the Windows codepage to be used.
Note on OEM codepages
OEM codepages are so called because for the first PC/PC-Compatible computers the display of characters was hard-wired, not software-done. The computer had a character generator with a fixed encoding and graphical definitions in a ROM. The BIOS would send a byte and a position (line, position in line) to the generator, and the generator would draw the corresponding glyph at this position. This was named "text-mode" at the time.
A computer sold in the US would have a different character ROM than one sold in Germany. This was really dependent on the manufacturer, and the BIOS was able to read the value of the installed codepage(s).
Later the generation of glyphs became software-based, to deal with unlimited fonts, style, and size. It was possible to define a set of glyphs and its corresponding encoding table at the OS level. This combination could be used on any computer, independently of the installed OEM generator.
Software-generated glyphs started with VGA display adapters, the code required for the drawing of glyphs was part of the VGA driver.
As you understood, +0251 is ASCII character, it does not represent a number.
You must understand that when you write 0 to the left of numbers it does not have any value but here it is ASCII codes and not numbers.

What is a multibyte character set?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets?
The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character).
Shift-jis, jis, euc-jp, euc-kr, along with Chinese encodings are typically included.
Most of the legacy encodings, with some exceptions, require a sort of state machine model (or, more simply, a page swapping model) to process, and moving backwards in a text stream is complicated and error-prone. UTF-8 and UTF-16 do not suffer from this problem, as UTF-8 can be tested with a bitmask and UTF-16 can be tested against a range of surrogate pairs, so moving backward and forward in a non-pathological document can be done safely without major complexity.
A few legacy encodings, for languages like Thai and Vietnamese, have some of the complexity of multibyte character sets but are really just built on combining characters, and aren't generally lumped in with the broad term "multibyte."
What is meant if anybody talks about multibyte character sets?
That, as usual, depends on who is doing the talking!
Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).
But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.
Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!
My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.
All character sets where you dont have a 1 byte = 1 character mapping. All Unicode variants, but also asian character sets are multibyte.
For more information, I suggest reading this Wikipedia article.
A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.
References:
IBM: Multibyte Characters
Unicode and MultiByte Character Set (archived), Unicode and Multibyte Character Set (MBCS) Support | Microsoft Docs
Unicode Consortium Website
A multibyte character set may consist of both one-byte and two-byte
characters. Thus a multibyte-character string may contain a mixture of
single-byte and double-byte characters.
Ref: Single-Byte and Multibyte Character Sets
UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.
UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().
UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.
Typically the former, i.e. UTF-8-like. For more info, see Variable-width encoding.
The former - although the term "variable-length encoding" would be more appropriate.
I generally use it to refer to any character that can have more than one byte per character.

Resources