What are Unicode codepoint types for? - go

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.
What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?

Texts have many different meaning and usages, so the question is difficult to answer.
First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.
Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.
The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).
Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.
And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.

The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
Where? It merely outlines advantages and disadvantages of code points. Two examples are:
Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
In other words: code points just index which graphemes Unicode supports.
Sometimes they're meant as single characters: one prominent example would be € (EURO SIGN), having only the code point U+20AC.
Sometimes the same character has multiple code-points as per context: the dollar sign exists as:
﹩ = U+FE69 (SMALL DOLLAR SIGN)
$ = U+FF04 (FULLWIDTH DOLLAR SIGN)
💲 = U+1F4B2 (HEAVY DOLLAR SIGN)
Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.
Sometimes multiple code points can be combined to form up a single character:
á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
á = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete á (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.
Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.
As for homoglyphs code points clearly distinguish the contextual meaning:
A = U+0041 (LATIN CAPITAL LETTER A)
Α = U+0391 (GREEK CAPITAL LETTER ALPHA)
А = U+0410 (CYRILLIC CAPITAL LETTER A)
Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.
Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:
今 = U+4ECA
入 = U+5165
才 = U+624D
Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

Related

Does it exist some kind of sorting convention?

Does it exist some established convention of sorting lines (characters)? Some convention which should play the similar role as PCRE for regular expressions.
For example, if you try to sort 0A1b-a2_B (each character on its own line) with Sublime Text (Ctrl-F9) and Vim (:%sort), the result will be the same (see below). However, I'm not sure it will be the same with another editors and IDEs.
-
0
1
2
A
B
_
a
b
Generally, characters are sorted based on their numeric value. While this used to only be applied to ASCII characters, this has also been adopted by unicode encodings as well. http://www.asciitable.com/
If no preference is given to the contrary, this is the de facto standard for sorting characters. Save for the actual alphabetical characters, the ordering is somewhat arbitrary.
There are two main ways of sorting character strings:
Lexicographic: numeric value of either the codepoint values or the code unit values or the serialized code unit values (bytes). For some character encodings, they would all be the same. The algorithm is very simple but this method is not human-friendly.
Culture/Locale-specific: an ordinal database for each supported culture is used. For the Unicode character set, it's called the CLDR. Also, in applying sorting for Unicode, sorting can respect grapheme clusters. A grapheme cluster is a base codepoint followed by a sequence of zero or more non-spacing (applied as extensions of the previous glyph) marks.
For some older character sets with one encoding, designed for only one or two scripts, the two methods might amount to the same thing.
Sometimes, people read a format into strings, such as a sequence of letters followed by a sequence of digits, or one of several date formats. These are very specialized sorts that need to be applied where users expect. Note: The ISO 8601 date format for the Julian calendar sorts correctly regardless of method (for all? character encodings).

Better algorithm for shortening English words

I have some unique codes that are generated from strings (ex: website host names) in various independent components of my application.
These codes are meant to be used by machines only so i would like to keep them as short as possible.
The below algorithm would be applied to every word in the string. The output words would be concatenated with a dash to generate the unique code.
The current algorithm I have used:
- Skip word if length is less than 6
- Leave first character as is
- Remove every wowel in the word from the second character onwards
architectural digest eu => archtctrl-dgst-eu
arizona foothills magazine => arzn-fthlls-mgzn
Is there a better way to shorten an English word leaving it as recognisable as possible to a human reader?
The output should be deterministic and produce the same shortened version whenever it is run on the same input.
A good algorithm should also minimise the number of clashes for similarly spelt words.
I have some unique codes that are generated from strings
I am afraid that is not true. There are many English words that will reduce to the same 'code word' when stripped of their vowels. For example, 'leaving' -> 'living' Given, this is fairly rare, it could still cause issues.
How important is it that these 'code words' remain human-readable if as you say, they are meant to be used by machines only? If its not that important, I'd suggest looking into some simpler compression algorithms like Huffman Coding or LZW Compression. Then if the user needs to see the translation of the code word, just uncompress it.
If you must keep it human-readable, I'm not sure that there is much more you can do to shorten it. You could take a look at specific latin + greek roots, and determine if you can shorten those any more by hand, and then just substitute those out automatically.
Alternatively, you could turn to a phonetic approach. Automatically search the pronunciation of the word, and then see if that is any shorter (or itself can be compressed, taking 'cee' to 'C', or 'kay' to 'K'). This would be much more time and CPU intensive, but its still an option if you really, really need short but yet readable codes.
What you're generating sounds like what's called a "slug". There are many libraries to handle this for blogs or site generators that should suit your purposes. Here's a usage example from a Python library called slugify:
txt = "___This is a test ---"
r = slugify(txt)
self.assertEqual(r, "this-is-a-test")
Slug libraries generally work like this:
replacing non-ascii linguistic characters via a mapping (ex: 影師嗎 -> ying-shi-ma)
replace accented latin letters with ascii equivalents via a mapping (ex: C'est déjà l'été. -> c-est-deja-l-ete)
remove beginning and trailing spaces/punctuation
convert remaining spaces and punctuation to dashes, collapsing multiple dashes in a row to a single dash
If you want to make slugs shorter you could remove vowels or, more simply, use a maximum length.

What is meaning of assume char set is ASCII?

I was solving below problem while reading its solution in first line I read this
can anyone help me in explaining assume char set is ASCII **I Don't want any other solution for this problem I just want to understand the statement **
Implement an algorithm to determine if a string has all unique characters. What if you can not use additional data structures
Thanks in advance for the help.
There is no text but encoded text.
Text is a sequence of "characters", members of a character set. A character set is a one-to-one mapping between a notional character and a non-negative integer, called a codepoint.
An encoding is a mapping between a codepoint and a sequence of bytes.
Examples:
ASCII, 128 codepoints, one encoding
OEM437, 256 codepoints, one encoding
Windows-1252, 251 codepoints, one encoding
ISO-8859-1, 256 codepoints, one encoding
Unicode, 1,114,112 codepoints, many encodings: UTF-8, UTF-16, UTF-32,…
When you receive a byte stream or read a file that represents text, you have to know the character set and encoding. Conversely, when you send a byte stream or write a file that represents text, you have let the receiver know the character set and encoding. Otherwise, you have a failed communication.
Note: Program source code is almost always text files. So, this communication requirement also applies between you, your editor/IDE and your compiler.
Note: Program console input and output are text streams. So, this communication requirement also applies between the program, its libraries and your console (shell). Go locale or chcp to find out what the encoding is.
Many character sets are a superset of ASCII and some encodings map the same characters with the same byte sequences. This causes a lot of confusion, limits learning, promotes usage of poor terminology and the partial interoperablity leads to buggy code. A deliberate approach to specifications and coding eliminates that.
Examples:
Some people say "ASCII" when they mean the common subset of characters between ASCII and the character set they are actually using. In Unicode and elsewhere this is called C0 Controls and Basic Latin.
Some people say "ASCII Code" when they just mean codepoint or the codepoint's encoded bytes (or code units).
The context of your question is unclear but the statement is trying to say that the distinct characters in your data are in the ASCII character set and therefore their number is less than or equal to 128. Due to the similarity between character sets, you can assume that the codepoint range you need to be concerned about is 0 to 127. (Put comments, asserts or exceptions as applicable in your code to make that clear to readers and provide some runtime checking.)
What this means in your programming language depends on the programming language and its libraries. Many modern programming languages use UTF-16 to represent strings and UTF-8 for streams and files. Programs are often built with standard libraries that account for the console's encoding (actual or assumed) when reading or writing from the console.
So, if your data comes from a file, you must read it using the correct encoding. If your data comes from a console, your program's standard libraries will possibly change encodings from the console's encoding to the encoding of the language's or standard library's native character and string datatypes. If your data comes from a source code file, you have to save it in one specific encoding and tell the compiler what that is. (Usually, you would use the default source code encoding assumed by the compiler because that generally doesn't change from system to system or person to person.)
The "additional" data structures bit probably refers to what a language's standard libraries provide, such as list, map or dictionary. Use what you've been taught so far, like maybe just an array. Of course, you can just ask.
Basically, assume that character codes will be within the range 0-127. You won't need to deal with crazy accented characters.
More than likely though, they won't use many, if any codes below 32; since those are mostly non-printables.
Characters such as 'a' 'b' '1' or '#' are encoded into a binary number when stored and used by a computer.
e.g.
'a' = 1100001
'b' = 1100010
There are a number of different standards that you could use for this encoding. ASCII is one of those standards. The other most common standard is called UTF-8.
Not all characters can be encoded by all standards. ASCII has a much more limited set of characters than UTF-8. As such an encoding also defines the set of characters "char set" that are supported by that encoding.
ASCII encodes each character into a single byte. It supports the letters A-Z, and lowercase a-z, the digits 0-9, a small number of familiar symbols, and a number of control characters that were used in early communication protocols.
The full set of characters supported by ASCII can be seen here: https://en.wikipedia.org/wiki/ASCII

Glyph to unicode string translation

Given a glyph index for a specific font, I need to get the unicode translation of the glyph. in order to build a glyph-to-unicode translation I'm using GetGlyphIndices for the whole unicode range and from the result I build the reverse translation (glyph to unicode character map). However, this gives me a translation between a single glyph to a single unicode character, and I can see that in Hindi for example, two unicode characters can be represented by one glyph.
For example, in the word namaste (नमस्ते) there are 6 unicode characters which are represented by 5 glyphs (the middle two unicode characters are represented by one glyph). I can see this by attaching to notepad.exe, inserting a breakpoint in ExtTextOut and printing this word from notepad.
Is there any way I can translate a glyph to a unicode string (in case the glyph represents more than one unicode character)?
1) For all but very simple cases, you should use Uniscribe functions (not GetGlyphIndices) for converting a string (sequence of Unicodes) into glyphs. This is noted in the documentation for GetGlyphIndices: http://msdn.microsoft.com/en-us/library/windows/desktop/dd144890(v=vs.85).aspx
2) There is no way to reliably do what you want to do for all cases. Even for most cases. This is the result of something known as complex script shaping, which translates a sequence of input Unicodes into a sequence of output glyphs. This is done using a number of tables in the font data. The two of most interest are the cmap and the GSUB.
The cmap maps Unicode values to font-specific glyphs. The cmap may specify multiple Unicodes mapping to a single glyph (multi-mapping). This is a commonly-used scheme in many fonts. Also, many glyphs in the font may not even be mapped in the cmap. Thus with this alone, you cannot reliably reverse-map a glyph to a single Unicode.
But it gets even more difficult: the GSUB may specify numerous rules and may convert one input glyph to many output glyphs, or a series of input glyphs into one output glyph. It can even specify contexts under which the conversion will occur (for example, it could say something like "convert 'A' to 'B' but only when the 'A' is preceded by a 'C'", so CA -> CB but DA -> DA). In some cases, specifically with Hindi and other Indic languages, the output glyph sequence may even be in a different order than the logical Unicode input sequence. The net result is that the output sequence of glyphs may map back to a single Unicode, or multiple Unicodes, or none at all. It may be possible to decode the rules of the GSUB + the logic of the script-shaping engine to narrow things down a bit (an adventure not suitable for the weak of spirit!), but the problem is still that multiple input Unicodes could end up resolving to the same output glyph.
Bottom line: it's best to view the process of converting a string -> font-specific glyphs as a one-way trip.
For a better understanding of these concepts, I strongly recommend that you read up on complex script shaping as implemented in Windows: http://www.microsoft.com/typography/otspec/TTOCHAP1.htm . As for coding in an application, the Uniscribe reference is also very informative: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374091(v=vs.85).aspx

Why are non-printable ASCII characters actually printable?

Characters, that are not alphanumeric or punctuation are termed not printable:
Codes 20hex to 7Ehex, known as the printable characters
So why is e.g. 005 representable (and represented by clubs)?
Most of the original set of ASCII control characters are no longer useful, so many different vendors have recycled them as additional graphic characters, often dingbats as in your table. However, all such assignments are nonstandard, and usually incompatible with each other. If you can, it's better to use the official Unicode codepoints for these characters. (Similar things have been done with the additional block of control characters in the high half of the ISO 8859.x standards, which were already obsolete at the time they were specified. Again, use the official Unicode codepoints.)
The tiny print at the bottom of your table appears to say "Copyright 1982 Leading Edge Computer Products, Inc." That company was an early maker of IBM PC clones, and this is presumably their custom ASCII extension. You should only pay attention to the assignments for 000-031 and 127 in this table if you're writing software to convert files produced on those specific computers to a more modern format.
The representation of the "not printable" chars depends on the used charset (of the OS, of the Browser, what ever), see ISO 8859, Code Page 1252 for example.
In dos for example you do have funny Signs that were used for very old style window frames (ascii art like).

Resources