I read this on this blog
Even with rune slices a single character might span multiple runes, which can happen if you have characters with grave accent, for example. This complicated and ambiguous nature of "characters" is the reason why Go strings are represented as byte sequences.
Is it true ? (it seems like a blog from someone who knows Go). I tested on my machine and "è" is 1 rune and 2 bytes. And the Go doc seems to say otherwise.
Have you encountered such characters ? (utf-8) Can a character span multiple runes in Go ?
Yes it can:
s := "é́́"
fmt.Println(s, []rune(s))
Output (try it on the Go Playground):
é́́ [101 769 769 769]
One character, 4 runes. It may be arbitrary long...
Example taken from The Go Blog: Text Normalization in Go.
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes. For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character. The definition of a character may vary depending on the application. For normalization we will define it as a sequence of runes that starts with a starter, a rune that does not modify or combine backwards with any other rune, followed by possibly empty sequence of non-starters, that is, runes that do (typically accents). The normalization algorithm processes one character at at time.
A character can be followed by any number of modifiers (modifiers can be repeated and stacked):
Theoretically, there is no bound to the number of runes that can make up a Unicode character. In fact, there are no restrictions on the number of modifiers that can follow a character and a modifier may be repeated, or stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a perfectly valid 4-rune character according to the standard.
Also see: Combining character.
Edit: "Doesn't this kill the 'concept of runes'?"
Answer: It's not a concept of runes. A rune is not a character. A rune is an integer value identifying a Unicode code point. A character may be one Unicode code point in which case 1 character is 1 rune. Most of the general use of runes fits into this case, so in practice this hardly gives any headaches. It's a concept of the Unicode standard.
Related
Does it exist some established convention of sorting lines (characters)? Some convention which should play the similar role as PCRE for regular expressions.
For example, if you try to sort 0A1b-a2_B (each character on its own line) with Sublime Text (Ctrl-F9) and Vim (:%sort), the result will be the same (see below). However, I'm not sure it will be the same with another editors and IDEs.
-
0
1
2
A
B
_
a
b
Generally, characters are sorted based on their numeric value. While this used to only be applied to ASCII characters, this has also been adopted by unicode encodings as well. http://www.asciitable.com/
If no preference is given to the contrary, this is the de facto standard for sorting characters. Save for the actual alphabetical characters, the ordering is somewhat arbitrary.
There are two main ways of sorting character strings:
Lexicographic: numeric value of either the codepoint values or the code unit values or the serialized code unit values (bytes). For some character encodings, they would all be the same. The algorithm is very simple but this method is not human-friendly.
Culture/Locale-specific: an ordinal database for each supported culture is used. For the Unicode character set, it's called the CLDR. Also, in applying sorting for Unicode, sorting can respect grapheme clusters. A grapheme cluster is a base codepoint followed by a sequence of zero or more non-spacing (applied as extensions of the previous glyph) marks.
For some older character sets with one encoding, designed for only one or two scripts, the two methods might amount to the same thing.
Sometimes, people read a format into strings, such as a sequence of letters followed by a sequence of digits, or one of several date formats. These are very specialized sorts that need to be applied where users expect. Note: The ISO 8601 date format for the Julian calendar sorts correctly regardless of method (for all? character encodings).
I am trying to implement LZ77 compression algorithm and encountered this problem.
I am compressing the input (could be any binary file, not only text files) on a byte by byte basis, and I use 3 bytes to represent a pointer/reference to previously substring. The first byte of the pointer is always an escape character, b"\xCC", to make things easier, let's say it's C.
The "standard" way I know when working with escape character is that, you encode all other chars normally, and escape the literal which has the same value as escape char. So 'ABCDE' encoded to 'ABCCDE'.
The problem is that, the value of the pointer could be 'CCx', where the second byte could be 'C' and makes the pointer un-distinguishable from escaped literal 'CC', and this causes problems.
How do I fix that? Or what's the correct/standard way to do LZ77? Thanks!
For LZ77 to be useful, it needs to be followed by an entropy encoder. It is in that step that you encode your symbols to bits that go in the compressed data.
One approach is to have 258 symbols defined, 256 for the literal bytes, one that indicates that a length and distance for a match follows, and one that indicates end of stream.
Or you can do what deflate does, which is encode the lengths and literals together, so that that symbol decodes to either a literal byte or a length, where a length implies that a distance code follows.
Or you can do what brotli does, which is define "insert and copy" codes, which give the number of literals, that is then followed by that many literal codes and then a copy length and distance.
Or you can invent your own.
So the documentation of CharLower says that it can also convert single characters, namely:
If the high-order word of this parameter is zero, the low-order word must contain a single character to be converted.
This is confusing me because if the high-order word should be zero'ed out, this would mean that CharLower() can only convert characters in the range of U+0000 to U+FFFF. But what about characters in higher ranges? Would I have to convert those to an LPTSTR first and pass that to CharLower() then or how is this supposed to work?
The full quote from the documentation is as follows:
A null-terminated string, or specifies a single character. If the high-order word of this parameter is zero, the low-order word must contain a single character to be converted.
This parameter is interpreted either as:
a pointer to a null terminated string, or
a single wchar_t value.
The reason that this is possible is that memory addresses < 65536 are reserved and considered invalid pointers. To use the function in this single character mode, you would call it like this:
WCHAR chr = (WCHAR) CharLowerW((WCHAR*)L'A');
You then ask:
This is confusing me because if the high-order word should be zero'ed out, this would mean that CharLower() can only convert characters in the range of U+0000 to U+FFFF. But what about characters in higher ranges? Would I have to convert those to an LPTSTR first and pass that to CharLower() then or how is this supposed to work?
This is correct. In the single character mode, surrogate pairs are not supported. You would have to pass those as a null-terminated string instead.
It is reasonable to guess that this interface dates back to the days when Windows supported UCS-2, a pre-cursor to UTF-16. UCS-2 was a fixed length encoding that only supported codepoints <= U+FFFF, and the problems that you describe did not arise. UTF-16 added surrogates for codepoints > U+FFFF. This interface design is comprehensive, albeit somewhat clunky.
Per Wikipedia, in UTF-8, the first byte in a multi-byte sequence is called a leading byte, and the subsequent bytes in the sequence are called continuation byte.
I understand these might not be the "official" names (in fact, the UTF-8 RFC does not provide any names for the different octet types), but according to Wikipedia and based on my research so far, these seem to be the names in common use.
Is there a special name in common use for a byte that is neither a leading byte nor a continuation byte (i.e., for code points < 128)?
I'm documenting some fairly complex code that is designed to work with UTF-8-encoded strings, and I'd like to make sure to use standard terminology to avoid confusion.
Everywhere I would expect to see a definition, I cannot find a special term for this (beyond the already mentioned ASCII). The only thing I can add is that a one-byte "sequence" is a legitimate sequence and that the one byte is not excluded from being called a leading byte.
References from the Unicode standard:
§3.9 (PDF, pg. 119)
A code unit sequence may consist of a single code unit.
§2.5 (PDF, pg. 37)
A range of 8-bit code unit values is reserved for the first, or leading, element of a UTF-8 code unit sequences, and a completely disjunct range of 8-bit code unit values is reserved for the subsequent, or trailing, elements of such sequences;
Some would refer to first 7bits of UTF-8 as ASCII.
Strings in 2.0 no longer conform to CollectionType. Each character in the String is now an Extended Graphene Cluster.
Without digging too deep about the Cluster stuff, I tried a few things with Swift Strings:
String now has a characters property that contains what we humans recognize as characters. Each distinct character in the string is considered a character, and the count property gives us the number of distinct characters.
What I don't quite understand is, even though the characters count shows 10, why does the index show emojis occupying 2 indexes?
The index of a String is no more related to the number of characters (count) in Swift 2.0. It is an “opaque” struct (defined as CharacterView.Index) used only to iterate through the characters of a string. So even if it is printed as an integer, it should not be considered or used as an integer, to which, for instance, you can sum 2 to get the second character from the current one. What you can do is only to apply the two methods predecessor and successor to get the previous or successive index in the String. So, for instance, to get the second character from that with index idx in mixedString you can do:
mixedString[idx.successor().successor()]
Of course you can use more confortable ways of reading the characters of string, like for instance, the for statement or the global function indices(_:).
Consider that the main benefit of this approach is not to the threat multi-bytes characters in Unicode strings, as emoticons, but rather to treat in a uniform way identical (for us humans!) strings that can have multiple representations in Unicode, as different set of “scalars”, or characters. An example is café, that can be represented either with four Unicode “scalars” (unicode characters), or with five Unicode scalars. And note that this is a completely different thing from Unicode representations like UTF-8, UTF-16, etc., that are ways of mapping Unicode scalars into memory bytes.
An Extended Graphene Cluster can still occupy multiple bytes, however, the correct way to determine the index position of a character would be:
let mixed = ("MADE IN THE USA 🇺🇸");
var index = mixed.rangeOfString("🇺🇸")
var intIndex: Int = distance(mixed.startIndex, index!.startIndex)
Result:
16
The way you are trying to get the index would normally be meant for an array, and I think Swift cannot properly work that out with your mixedString.