CharLower() and characters wider than 16 bits - winapi

So the documentation of CharLower says that it can also convert single characters, namely:
If the high-order word of this parameter is zero, the low-order word must contain a single character to be converted.
This is confusing me because if the high-order word should be zero'ed out, this would mean that CharLower() can only convert characters in the range of U+0000 to U+FFFF. But what about characters in higher ranges? Would I have to convert those to an LPTSTR first and pass that to CharLower() then or how is this supposed to work?

The full quote from the documentation is as follows:
A null-terminated string, or specifies a single character. If the high-order word of this parameter is zero, the low-order word must contain a single character to be converted.
This parameter is interpreted either as:
a pointer to a null terminated string, or
a single wchar_t value.
The reason that this is possible is that memory addresses < 65536 are reserved and considered invalid pointers. To use the function in this single character mode, you would call it like this:
WCHAR chr = (WCHAR) CharLowerW((WCHAR*)L'A');
You then ask:
This is confusing me because if the high-order word should be zero'ed out, this would mean that CharLower() can only convert characters in the range of U+0000 to U+FFFF. But what about characters in higher ranges? Would I have to convert those to an LPTSTR first and pass that to CharLower() then or how is this supposed to work?
This is correct. In the single character mode, surrogate pairs are not supported. You would have to pass those as a null-terminated string instead.
It is reasonable to guess that this interface dates back to the days when Windows supported UCS-2, a pre-cursor to UTF-16. UCS-2 was a fixed length encoding that only supported codepoints <= U+FFFF, and the problems that you describe did not arise. UTF-16 added surrogates for codepoints > U+FFFF. This interface design is comprehensive, albeit somewhat clunky.

Related

LZ77 and escaping character

I am trying to implement LZ77 compression algorithm and encountered this problem.
I am compressing the input (could be any binary file, not only text files) on a byte by byte basis, and I use 3 bytes to represent a pointer/reference to previously substring. The first byte of the pointer is always an escape character, b"\xCC", to make things easier, let's say it's C.
The "standard" way I know when working with escape character is that, you encode all other chars normally, and escape the literal which has the same value as escape char. So 'ABCDE' encoded to 'ABCCDE'.
The problem is that, the value of the pointer could be 'CCx', where the second byte could be 'C' and makes the pointer un-distinguishable from escaped literal 'CC', and this causes problems.
How do I fix that? Or what's the correct/standard way to do LZ77? Thanks!
For LZ77 to be useful, it needs to be followed by an entropy encoder. It is in that step that you encode your symbols to bits that go in the compressed data.
One approach is to have 258 symbols defined, 256 for the literal bytes, one that indicates that a length and distance for a match follows, and one that indicates end of stream.
Or you can do what deflate does, which is encode the lengths and literals together, so that that symbol decodes to either a literal byte or a length, where a length implies that a distance code follows.
Or you can do what brotli does, which is define "insert and copy" codes, which give the number of literals, that is then followed by that many literal codes and then a copy length and distance.
Or you can invent your own.

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.
What are surrogate characters in UTF-8?
This is almost like a trick question.
Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).
Approximate answer #2: Invalid (if not paired).
Approximate answer #3: It's not UTF-8; It's Modified UTF-8.
Synopsis: The term doesn't apply to UTF-8.
Unicode codepoints have a range that needs 21 bits of data.
UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.
UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.
#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.
#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).
#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.
Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

bits and bytes and what form are them

I'm still confused about the bits and bytes although I've been searching through the internet. Is that one character of ASCII = 1 bytes = 8 bits? So 8 bits have 256 unique pattern that covered all the ASCII code, what form is it stored in our computer?
And if I typed "Hello" does that mean this consists of 5 bytes?
Yes to everything you wrote. "Bit" is a binary digit: a 0 or a 1. Historically there existed bytes of smaller sizes; now "byte" only ever means "8 bits of information", or a number between 0 and 255.
No. ASCII is a character set with 128 codepoints stored as the values 0-127. Modern computers predominantly address 8-bit memory and disk locations so a 7-bit ASCII value takes up 8 bits.
There is no text but encoded text. An encoding maps a member of a character set to one or more bytes. Unless you absolutely know you are using ASCII, you probably aren't. There are quite a few character sets with encodings that cover all 256 byte values and use any combination of byte values to encode a string.
There are several character sets that are similar but have a few less than 256 characters. And others that use more than one byte to encode a codepoint and don't use every combination of byte values.
Just so you know, Unicode is the predominant character set except in very specialized situations. It has several encodings. UTF-8 is often used for storage and streams. UTF-16 is often used in memory, particularly in Java, .NET, JavaScript, XML, …. When text is communicated between systems, there has to be an agreement, specification, standard, or indication about which character set and encoding it uses so a sequence of bytes can be interpreted as characters.
To add to the confusion, programming languages have data types called char, Character, etc. You have to look at the specific language's reference manual to see what they mean. For example in C, char is simply an integer that is defined as the size of the encoding of character used by that C implementation. (C also calls this a "byte" and it is not necessarily 8 bits. In all other contexts, people mean 8 bits when they say "byte". If they want to be exceedingly unambiguous they might say "octet".)
"Hello" is five characters. In a specific character set, it is five codepoints. In a specific encoding for that character set, it could be 5, 10 or 20, or ??? bytes.
Also, in the source code of a specific language, a literal string like that might be "null-terminated". This means that you could say it is 6 "characters". Other languages might store a string as a counted sequence of code units. Again, you have to look at the language reference to know the underlying data structure of strings. Of, if the language and the libraries used with it are sufficiently high-level, you might never need to know such internals.

Can a character span multiple runes in Go?

I read this on this blog
Even with rune slices a single character might span multiple runes, which can happen if you have characters with grave accent, for example. This complicated and ambiguous nature of "characters" is the reason why Go strings are represented as byte sequences.
Is it true ? (it seems like a blog from someone who knows Go). I tested on my machine and "è" is 1 rune and 2 bytes. And the Go doc seems to say otherwise.
Have you encountered such characters ? (utf-8) Can a character span multiple runes in Go ?
Yes it can:
s := "é́́"
fmt.Println(s, []rune(s))
Output (try it on the Go Playground):
é́́ [101 769 769 769]
One character, 4 runes. It may be arbitrary long...
Example taken from The Go Blog: Text Normalization in Go.
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes. For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character. The definition of a character may vary depending on the application. For normalization we will define it as a sequence of runes that starts with a starter, a rune that does not modify or combine backwards with any other rune, followed by possibly empty sequence of non-starters, that is, runes that do (typically accents). The normalization algorithm processes one character at at time.
A character can be followed by any number of modifiers (modifiers can be repeated and stacked):
Theoretically, there is no bound to the number of runes that can make up a Unicode character. In fact, there are no restrictions on the number of modifiers that can follow a character and a modifier may be repeated, or stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a perfectly valid 4-rune character according to the standard.
Also see: Combining character.
Edit: "Doesn't this kill the 'concept of runes'?"
Answer: It's not a concept of runes. A rune is not a character. A rune is an integer value identifying a Unicode code point. A character may be one Unicode code point in which case 1 character is 1 rune. Most of the general use of runes fits into this case, so in practice this hardly gives any headaches. It's a concept of the Unicode standard.

Does a 1-byte UTF-8 "sequence" have a special name?

Per Wikipedia, in UTF-8, the first byte in a multi-byte sequence is called a leading byte, and the subsequent bytes in the sequence are called continuation byte.
I understand these might not be the "official" names (in fact, the UTF-8 RFC does not provide any names for the different octet types), but according to Wikipedia and based on my research so far, these seem to be the names in common use.
Is there a special name in common use for a byte that is neither a leading byte nor a continuation byte (i.e., for code points < 128)?
I'm documenting some fairly complex code that is designed to work with UTF-8-encoded strings, and I'd like to make sure to use standard terminology to avoid confusion.
Everywhere I would expect to see a definition, I cannot find a special term for this (beyond the already mentioned ASCII). The only thing I can add is that a one-byte "sequence" is a legitimate sequence and that the one byte is not excluded from being called a leading byte.
References from the Unicode standard:
§3.9 (PDF, pg. 119)
A code unit sequence may consist of a single code unit.
§2.5 (PDF, pg. 37)
A range of 8-bit code unit values is reserved for the first, or leading, element of a UTF-8 code unit sequences, and a completely disjunct range of 8-bit code unit values is reserved for the subsequent, or trailing, elements of such sequences;
Some would refer to first 7bits of UTF-8 as ASCII.

Resources