What are surrogate characters in UTF-8? - utf-8

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.

What are surrogate characters in UTF-8?
This is almost like a trick question.
Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).
Approximate answer #2: Invalid (if not paired).
Approximate answer #3: It's not UTF-8; It's Modified UTF-8.
Synopsis: The term doesn't apply to UTF-8.
Unicode codepoints have a range that needs 21 bits of data.
UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.
UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.
#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.
#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).
#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.
Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

Related

Converting an UTF-16 index to a UTF-8 compatible one

I am currently working with the Telegram API, and in one of its methods it returns the following information:
a piece of text
offset in UTF-16 Code Units
length in UTF-16 Code Units
In my programming language, Rust, all strings are valid UTF-8. This means that UTF-16 offsets are not immediately useful as they can be off by a variable amount (due to 1 or 3 byte code points). A one-byte code-point in UTF-8 corresponds to a two-byte one in UTF-16, so I cannot simply index the UTF-8 string as I may lay outside of code-point boundaries.
I am wondering now: Is there a way to convert it to valid UTF-8 without iterating through the UTF-8 string or if the information is useless once in UTF-8?

How to Decode UTF-8 Text Sequence \ud83e\udd14

I'm reading UTF-8 text that contains "\ud83e\udd14". Reading the specification, it says that U+D800 to U+DFFF are not used. Yet if I run this through a decoder such as Microsoft's System.Web.Helpers.Json.Decode, it yields the correct result of an emoticon of a face with a tongue hanging out. The text originates through Twitter's search api.
My question: how should this sequence be decoded? I'm looking for what the final hex sequence would be and how it is obtained. Thanks for any guidance. If my question isn't clear, please let me know and I will try to improve it.
You are coming at this from an interesting perspective. The first thing to note is that you're dealing with two levels of text: a JSON document and a string within it.
Synopsis: You don't need to write code to decode it. Use a library that deserializes JSON into objects, such as Newtonsoft's JSON.Net.
But, first, Unicode. Unicode is a character set with a bit of a history. Unlike almost every character set, 1) it has more than one encoding, and 2) it is still growing. A couple of decades ago, it had <65636 codepoints and that was thought to be enough. So, encoding each codepoint with as 2-byte integer was the plan. It was called UCS-2 or, simply, the Unicode encoding. (Microsoft has stuck with Encoding.Unicode in .NET, which causes some confusion.)
Aside: Codepoints are identified for discussion using the U+ABCD (hexadecimal) format.
Then the Unicode consortium decided to add more codepoints: all the way to U+10FFFF. For that, encodings need at least 21 bits. UTF-32, integers with 32 bits, is an obvious solution but not very dense. So, encodings that use a variable number of code units where invented. UTF-8 uses one to four 8-bit code units, depending on the codepoint.
But a lot of languages were adopting UCS-2 in the 1990s. Documents, of course, can be transformed at will but code that processes UCS-2 would break without a compatible encoding for the expanded character set. Since U+D800 to U+DFFF where unassigned, UCS-2 could stay the same and those "surrogate codepoints" could be used to encode new codepoints. The result is UTF-16. Each codepoint is encoded in one or two 16-bit code units. So, programs that processed UCS-2 could automatically process UTF-16 as long as they didn't need to understand it. Programs written in the same system could be considered to be processing UTF-16, especially with libraries that do understand it. There is still the hazard of things like string length giving the number of UTF-16 code units rather than the number of codepoints, but it has otherwise worked out well.
As for the \ud83e\udd14 notation, languages use Unicode in their syntax or literal strings desired a way to accept source files in a non-Unicode encoding and still support all the Unicode codepoints. Being designed in the 1990s, they simply wrote the UCS-2 code units in hexadecimal. Of course, that too is extended to UTF-16. This UTF-16 code unit escaped syntax allows intermediary systems to handle source code files with a non-Unicode encoding.
Now, JSON is based on JavaScript and JavaScript's strings are sequences of UTF-16 code units. So JSON has adopted th UTF-16 code unit escaped syntax from JavaScript. However, it's not very useful (unless you have to deal with intermediary systems that can't be made to use UTF-8 or treat files they don't understand as binary). The old JSON standard requires JSON documents exchanged between systems to be encoded with UTF-8, UTF-16 or UTF-32. The new RFC8259 requires UTF-8.
So, you don't have "UTF-8 text", you have Unicode text encoding with UTF-8. The text itself is a JSON document. JSON documents have names and values that are Unicode text as sequences of UTF-16 code units with escapes allowed. Your document has the codepoint U+1F914 written, not as "🤔" but as "\ud83e\udd14".
There are plenty of libraries that transform JSON to objects so you shouldn't need to decode the names or values in a JSON document. To do it manually, you'd recognize the escape prefix and take the next 4 characters as the bits of a surrogate, extracting the data bits, then combine them with the bits from the paired surrogate that should follow.
Thought I'd read up on UTF-16 to see if it gave me any clues, and it turns out this is what it calls a surrogate pair. The hex formula for decoding is:
(H - D800) * 400 + (L - DC00) + 10000
where H is the first (high) codepoint and L is the second (low) codepoint.
So \ud83e\udd14 becomes 1f914
Apparently UTF-8 decoders must anticipate UTF-16 surrogate pairs.

bits and bytes and what form are them

I'm still confused about the bits and bytes although I've been searching through the internet. Is that one character of ASCII = 1 bytes = 8 bits? So 8 bits have 256 unique pattern that covered all the ASCII code, what form is it stored in our computer?
And if I typed "Hello" does that mean this consists of 5 bytes?
Yes to everything you wrote. "Bit" is a binary digit: a 0 or a 1. Historically there existed bytes of smaller sizes; now "byte" only ever means "8 bits of information", or a number between 0 and 255.
No. ASCII is a character set with 128 codepoints stored as the values 0-127. Modern computers predominantly address 8-bit memory and disk locations so a 7-bit ASCII value takes up 8 bits.
There is no text but encoded text. An encoding maps a member of a character set to one or more bytes. Unless you absolutely know you are using ASCII, you probably aren't. There are quite a few character sets with encodings that cover all 256 byte values and use any combination of byte values to encode a string.
There are several character sets that are similar but have a few less than 256 characters. And others that use more than one byte to encode a codepoint and don't use every combination of byte values.
Just so you know, Unicode is the predominant character set except in very specialized situations. It has several encodings. UTF-8 is often used for storage and streams. UTF-16 is often used in memory, particularly in Java, .NET, JavaScript, XML, …. When text is communicated between systems, there has to be an agreement, specification, standard, or indication about which character set and encoding it uses so a sequence of bytes can be interpreted as characters.
To add to the confusion, programming languages have data types called char, Character, etc. You have to look at the specific language's reference manual to see what they mean. For example in C, char is simply an integer that is defined as the size of the encoding of character used by that C implementation. (C also calls this a "byte" and it is not necessarily 8 bits. In all other contexts, people mean 8 bits when they say "byte". If they want to be exceedingly unambiguous they might say "octet".)
"Hello" is five characters. In a specific character set, it is five codepoints. In a specific encoding for that character set, it could be 5, 10 or 20, or ??? bytes.
Also, in the source code of a specific language, a literal string like that might be "null-terminated". This means that you could say it is 6 "characters". Other languages might store a string as a counted sequence of code units. Again, you have to look at the language reference to know the underlying data structure of strings. Of, if the language and the libraries used with it are sufficiently high-level, you might never need to know such internals.

ASCII characters set

I am reading a (.txt) file, the contents of first line are just the four alphabet letters: "abcd".
when I display the ASCII code of these letters, I expect I to find 97,98,99 and 100 respectively for a,b,c and b. But I found tow special characters which their ASCII code are 255 and 254 for ÿ and þ.
Therefore the length of read line is 6 not 4 because of "ÿþabcd". Are these special character must-to-insert at the start of any sequential text file or is there any way to avoid both of them?
ASCII is only used in niche or archaic systems. Your data proved to you that your file is not ASCII. You must find out which character set and encoding the file was stored in.
Character Sets
All text is an encoding of elements of a character set. Elements of a character set are called codepoints. A character set consists of a list of codepoints and their descriptions. The description states how the codepoint is used semantically in text, such as LATIN CAPITAL LETTER A (A) or N-ARY PRODUCT (∏). (The style in which the codepoint is rendered is the purview of typefaces/fonts.)
Encodings
Codepoints are numbered with non-negative integers. The number is encoded into bytes. Most character sets have only one encoding, which is the number as an unsigned integer in the smallest size that can represent all of the codepoints. For example, Windows-1252 has 251 codepoints, with numbers between 0 and 255. A byte is big enough to represent any of them. The Unicode character set has about 1 million codepoints, numbered from 0 to 1,114,112. A 32-bit integer is big enough to represent all of them. That's the UTF-32 encoding.
Byte order
Computer memory is usually byte-addressable and file are byte sequences so for a large integer the question becomes, in which order the bytes are stored: The most significant byte first (big endian) or least significant byte first (little endian). Software adapts to or assumes one way or the other. So, UTF-32 actually identifies one of two encodings: UTF32BE or UTF-32LE. UTF-32 is shorthand for the endianness that the software assumes. Typically, the OS assumes the endianness of the hardware it is running on and programs follow suit.
Unicode Encodings
UTF-32 takes a lot of space. The most commonly used codepoints are numbered below 65,536. So there can be savings if codepoints are represented variable number of smaller integers. The size of the integer is called a code unit. The value of a code unit contains some of the bits of a code unit and indicates if there are more code units to follow that have more of the bits. So, there are UTF-16LE and UTF-16BE and UTF-8 (and more) encodings for Unicode. UTF-16 uses one or two 16-bit code units for a codepoint and UTF-8 uses one to four 8-bit code units for a codepoint.
Files are data outside of programs. So for a program to read text, it has to know the character set and encoding. Often this metadata is not stored with the file (within or beside). That's how you made the mistake of believing your file is ASCII. If you don't know the encoding of a file, you've lost data. You might be able to recover it through guessing. It is notable that the CP437 character set has 256 codepoints, numbered 0 to 255 and encoded in one byte. So every file can be read as CP437; The question is, is that right? Even if it looks right, it's probably not right unless it's from a Western culture circa 1990.
Unicode Byte-order Mark
A strong clue about which character set and encoding to guess is called the byte-order mark (BOM). Recall encodings with code units larger than one byte have an endianness. Endianness is a hardware concern. So, although a file can be passed between systems with agreement on which character set and encoding scheme is used, the endianness attribute of encoding is critical to each system. It has become standard to indicate byte order within the file, as the first bytes. Unicode specifies a codepoint to use for this purpose, as long as it is at the beginning of the file. (That means that programs reading Unicode from a file must separate this metadata from the data.) Many file writing libraries write the BOM codepoint regardless of the code unit size. So, you'll see it at the beginning of UTF-8 files. Since the Unicode BOM looks different in each of the Unicode encodings, it completely identifies which Unicode encoding is being used.
Guessing
Your file begins with the UTF-16LE BOM. Read it as UTF-16LE (and discard the BOM codepoint if your library doesn't already.)
Given the specificness of the Unicode BOM, its presence is a strong indicator that the file is encoded in Unicode and the actual bytes tell which Unicode encoding. However, as noted above, it's possible that this guess is wrong.
As #Lưu Vĩnh Phúc points out, it is unclear how you are reading "ÿþabcd" from what you say is a 6-byte file. Open the file in a hex editor. UTF-16LE should be FF FE 61 00 62 00 63 00 64 00.

What is the maximum number of bytes for a UTF-8 encoded character?

What is the maximum number of bytes for a single UTF-8 encoded character?
I'll be encrypting the bytes of a String encoded in UTF-8 and therefore need to be able to work out the maximum number of bytes for a UTF-8 encoded String.
Could someone confirm the maximum number of bytes for a single UTF-8 encoded character please
The maximum number of bytes per character is 4 according to RFC3629 which limited the character table to U+10FFFF:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets.
(The original specification allowed for up to six byte character codes for code points past U+10FFFF.)
Characters with a code less than 128 will require 1 byte only, and the next 1920 character codes require 2 bytes only. Unless you are working with an esoteric language, multiplying the character count by 4 will be a significant overestimation.
Without further context, I would say that the maximum number of bytes for a character in UTF-8 is
answer: 6 bytes
The author of the accepted answer correctly pointed this out as the "original specification". That was valid through RFC-2279 1. As J. Cocoe pointed out in the comments below, this changed in 2003 with RFC-3629 2, which limits UTF-8 to encoding for 21 bits, which can be handled with the encoding scheme using four bytes.
answer if covering all unicode: 4 bytes
But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane (BMP), i.e. it is an older version of unicode, or subset of modern unicode. So
answer if representing only original unicode, the BMP: 3 bytes
But, the OP talks about going the other way. Not from characters to UTF-8 bytes, but from UTF-8 bytes to a "String" of bytes representation. Perhaps the author of the accepted answer got that from the context of the question, but this is not necessarily obvious, so may confuse the casual reader of this question.
Going from UTF-8 to native encoding, we have to look at how the "String" is implemented. Some languages, like Python >= 3 will represent each character with integer code points, which allows for 4 bytes per character = 32 bits to cover the 21 we need for unicode, with some waste. Why not exactly 21 bits? Because things are faster when they are byte-aligned. Some languages like Python <= 2 and Java represent characters using a UTF-16 encoding, which means that they have to use surrogate pairs to represent extended unicode (not BMP). Either way that's still 4 bytes maximum.
answer if going UTF-8 -> native encoding: 4 bytes
So, final conclusion, 4 is the most common right answer, so we got it right. But, mileage could vary.
The maximum number of bytes to support US-ASCII, a standard English alphabet encoding, is 1. But limiting text to English is becoming less desirable or practical as time goes by.
Unicode was designed to represent the glyphs of all human languages, as well as many kinds of symbols, with a variety of rendering characteristics. UTF-8 is an efficient encoding for Unicode, although still biased toward English. UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction.
While the maximum number of bytes per UTF-8 character is 3 for supporting just the 2-byte address space of Plane 0, the Basic Multilingual Plane (BMP), which can be accepted as minimal support in some applications, it is 4 for supporting all 17 current planes of Unicode (as of 2019). It should be noted that many popular "emoji" characters are likely to be located in Plane 16, which requires 4 bytes.
However, this is just for basic character glyphs. There are also various modifiers, such as making accents appear over the previous character, and it is also possible to link together an arbitrary number of code points to construct one complex "grapheme". In real world programming, therefore, the use or assumption of a fixed maximum number of bytes per character will likely eventually result in a problem for your application.
These considerations imply that UTF-8 character strings should not "expanded" into arrays of fixed length prior to processing, as has sometimes been done. Instead, programming should be done directly, using string functions specifically designed for UTF-8.
Condidering just technical limitations - it's possible to have up to 7 bytes following current UTF8 encoding scheme. According to it - if first byte is not self-sufficient ASCII character, than it should have pattern: 1(n)0X(7-n), where n is <= 7.
Also theoretically it could be 8 but then first byte would have no zero bit at all. While other aspects, like continuation byte differing from leading, are still there (allowing error detection), I heared, that byte 11111111 could be invalid, but I can't be sure about that.
Limitatation for max 4 bytes is most likely for compatibility with UTF-16, which I tend to consider a legacy, because the only quality where it excels, is processing speed, but only if string byte order matches (i.e. we read 0xFEFF in the BOM).

Resources