A two-byte UTF-8 character provides 11 bits for the payload:
110x xxxx 10xx xxxx
I know this isn't efficient, but why can't that payload be in the lower range, such that 'A'=65?
65 decimal = 1000001 binary
two-byte template: 110_ ____ 10__ ____
payload: 1 00 0001
result: 1100 0001 1000 0001
According to Wikipedia, the standard specifically disallows this. While the UTF-8 algorithm allows you to use more bytes than necessary through zero padding, the standard dictates that there is a one-to-one mapping between code points and encoding byte sequences.
An overlong encoding like the one you present is treated as an invalid byte sequence.
Likewise, you aren't allowed to encode code points above U+10FFFF, even though the algorithm could get you up to U+7FFFFFFF.
You are however allowed to encode code points that haven't been assigned (ie. which have no meaning yet).
This is a security risk. Allowing software to interpret "non-shortest form" UTF-8 prevents accurate and efficient string checks. "DROP T[41]BLE" shouldn't be the same as "DROP T[C181]BLE". The problem with such an implementation would be buried if bad UTF-8 gets converted to good UTF-16 before it's used.
http://www.unicode.org/versions/corrigendum1.html
(Special thanks to lenz for pointing me to useful search terms.)
This question already has answers here:
How does UTF-8 "variable-width encoding" work?
(3 answers)
Closed 5 years ago.
I hope this is not a silly question at this time of night, but I can't seem to wrap my mind around it.
UTF-8 is a variable length encoding with a minimum of 8 bits per character. Characters with higher code points will take up to 32 bits.
So UTF-8 can encode unicode characters in a range of 1 to 4 bytes.
Does this mean that in a single UTF-8 encoded string, that one character may be 1 byte and another character may be 3 bytes?
If so, how in this example does a computer, when decoding from UTF-8, not try to treat those two separate characters as one 4 byte character?
If the data is held in memory as UTF-8 then, yes, it will be a variable width encoding.
However, the encoding allows a parser to know if the byte you are looking at is the start of a codepoint or an extra character.
From the Wikipedia page for UTF-8:
Bytes Bits First Last Bytes
1 7 U+000000 U+00007F 0xxxxxxx
2 11 U+000080 U+0007FF 110xxxxx 10xxxxxx
3 16 U+000800 U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+010000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Why there are no 5-byte or 6-byte code points? I know they were till 2003 when they were removed. But I cannot find why were they removed.
The Wikipedia page on UTF-8 says
In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.
but I don't understand why it's important.
Because there are no Unicode characters which would require them. And these cannot be added either because they'd be impossible to encode with UTF-16 surrogates.
I’ve heard some reasons, but did’t find any of them convincing. Basically, the stupid reason is: UTF-16 was specified before UTF-8 and at that time, 20 bits of storage for characters (yielding 2²⁰+2¹⁶ caracters minus a little like non-characters and surrogates for management) were deemed enough.
UTF-8 and UTF-16 are already variable-length encodings that, as you said for UTF-8, could be extended without big hastle (use 5- and 6-byte words). Extending UTF-32 to include 21 to 31 bits is trivial (32 could be a problem due to signdness), but making it variable-length defeats the use-case of UTF-32 completely.
Extending UTF-16 is hard, but I’ll try. Look at what UTF-8 does in a 2-byte sequence: The initial 110yyyyy acts like a high surrogate and 10zzzzzz like a low surrogate. For UTF-16, flip it around and re-use high surrogates as “initial surrogates” and low surrogates as “continue surrogates”. So, basically, you can have multiple low surrogates.
There’s a problem, though: Unicode streams are supposed to resist misinterpretation when you’re “tuning in” or the sender is “tuning out”.
In UTF-8, if you read a stream of bytes and it ends with 11100010 10000010, you know for sure the stream is incomplete. 1110 tells you: This is a 3-byte word, but one is still missing. In the suggested “extended UTF-16”, there’s nothing like that.
In UTF-16, if you read a stream of bytes and it ends with a high surrogate, you know for sure the stream is incomplete.
The “tuning out” can be solved by using U+10FFFE as an announcement for a single UTF-32 encoding. If the stream stops after U+10FFFE, you know you’re missing something, same goes for an incomplete UTF-32. And if it stops in the middle of the U+10FFFE, it’s lacking a low surrogate. But that does not work becasue “tuning in” to the UTF-32 encoding can mislead you.
What could be utilized are so-called non-characters (the most well-known would be the reverse of the byte order mark) at the end of plane 16: Encode U+10FFFE and U+10FFFF using existing surrogates to announce a 3- or 4-byte sequence, repectively. This is very wasteful: 32 bits are used for the announcement alone, 48 or 64 additional bits are used for the actual encoding. However, it is still better than, say using U+10FFFE and U+10FFFF around a single UTF-32 encoding.
Maybe there’s something flawed in this reasoning. This is an argument of the sort: This is hard and I’ll prove it by trying and showing where the traps are.
right now the space is allocated for 4^8 + 4^10 code points (CP), i.e. 1,114,112, but barely 1/4 to 1/3rd of that is assigned to anything.
so unless there's a sudden need to add in another 750k CPs in a very short duration, up to 4 bytes for UTF-8 should be more than enough for years to come.
** just personal preference for
4^8 + 4^10
on top of clarity and simplicity, it also clearly delineates the CPs by UTF-8 byte count :
4 ^ 8 = 65,536 = all CPs for 1-, 2-, or 3-bytes UTF-8
4 ^ 10 = 1,048,576 = all CPs for 4-bytes UTF-8
instead of something unseemly like
2^16 * 17
or worse,
32^4 + 16^4
*** unrelated sidenote : *the cleanest formula-triplet I managed to conjure up for the starting points of the UTF-16 surrogates are :: *
4^5 * 54 = 55,296 = 0x D800 = High - surrogates
4^5 * 55 = 56,320 = 0x DC00 = Low - surrogates
4^5 * 56 = 57,344 = 0x E000 = just beyond the upper-boundary of 0x DFFF
I'm trying to understand what encodings in Ruby are - there are lots of articles about encodings, such as this one and this one. However, none of them explain the basic question a newbie might have - what is an encoding in the first place?
What is meant by character encoding here is a system of describing how computers represent characters in binary.
In UTF-8 encoding, the character ä is represented as 1100 0011 1010 0100, or 0xC3 0xA4 in hexadecimal.
In Windows-1252 encoding, the same character is represented as 1110 0100 or 0xE4 in hexadecimal.
So let's say you tell a computer to read a file in Windows-1252, but the file is actually encoded as UTF-8. The file contains just one character, say ä. Since the file is in UTF-8, the file actually contains the bits 0xC3 0xA4. Now because you told (implicitly or explicitly) the computer to read the file in Windows-1252, you will actually see ä instead of ä.
An encoding is a means of transforming some sequence of bytes into text. ASCII is one where there's a single byte per character. UTF-8 is another common one that uses a variable number of bytes to encode a somewhat larger set of characters. And, of course, Character Encoding on Wikipedia is probably a helpful read.
If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?
The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters?
How does this work?
UTF-8 does not use one byte all the time, it's 1 to 4 bytes.
The first 128 characters (US-ASCII) need one byte.
The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.
Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.
Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
source: Wikipedia
UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.
So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.
Unicode vs UTF-8
Unicode resolves code points to characters. UTF-8 is a storage mechanism for Unicode. Unicode has a spec. UTF-8 has a spec. They both have different limits. UTF-8 has a different upwards-bound.
Unicode
Unicode is designated with "planes." Each plane carries 216 code points. There are 17 Planes in Unicode. For a total of 17 * 2^16 code points. The first plane, plane 0 or the BMP, is special in the weight of what it carries.
Rather than explain all the nuances, let me just quote the above article on planes.
The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.
UTF-8
Now let's go back to the article linked above,
The encoding scheme used by UTF-8 was designed with a much larger limit of 231 code points (32,768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.[3] Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.
So you can see that you can put stuff into UTF-8 that isn't valid Unicode. Why? Because UTF-8 accommodates code points that Unicode doesn't even support.
UTF-8, even with a four byte limitation, supports 221 code points, which is far more than 17 * 2^16
According to this table* UTF-8 should support:
231 = 2,147,483,648 characters
However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us
221 = 2,097,152 characters
Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.
* Wikipedia used show a table with 6 bytes -- they've since updated the article.
2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes
2,164,864 “characters” can be potentially coded by UTF-8.
This number is 27 + 211 + 216 + 221, which comes from the way the encoding works:
1-byte chars have 7 bits for encoding
0xxxxxxx (0x00-0x7F)
2-byte chars have 11 bits for encoding
110xxxxx 10xxxxxx (0xC0-0xDF for the first byte; 0x80-0xBF for the second)
3-byte chars have 16 bits for encoding
1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte; 0x80-0xBF for continuation bytes)
4-byte chars have 21 bits for encoding
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte; 0x80-0xBF for continuation bytes)
As you can see this is significantly larger than current Unicode (1,112,064 characters).
UPDATE
My initial calculation is wrong because it doesn't consider additional rules. See comments to this answer for more details.
UTF-8 is a variable length encoding with a minimum of 8 bits per character.
Characters with higher code points will take up to 32 bits.
Quote from Wikipedia: "UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard)."
Some links:
http://www.utf-8.com/
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
http://en.wikipedia.org/wiki/UTF-8
Check out the Unicode Standard and related information, such as their FAQ entry, UTF-8 UTF-16, UTF-32 & BOM. It’s not that smooth sailing, but it’s authoritative information, and much of what you might read about UTF-8 elsewhere is questionable.
The “8” in “UTF-8” relates to the length of code units in bits. Code units are entities use to encode characters, not necessarily as a simple one-to-one mapping. UTF-8 uses a variable number of code units to encode a character.
The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. They all encode the entire Unicode coding space, which even includes noncharacters and unassigned code points.
While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I couldn't comment on his), he is off by 2 levels if you remove the 2 major restrictions of UTF-8: only 4 bytes limit and codes 254 and 255 can not be used (he only removed the 4 byte limit).
Starting code 254 follows the basic arrangement of starting bits (multi-bit flag set to 1, a count of 6 1's, and terminal 0, no spare bits) giving you 6 additional bytes to work with (6 10xxxxxx groups, an additional 2^36 codes).
Starting code 255 doesn't exactly follow the basic setup, no terminal 0 but all bits are used, giving you 7 additional bytes (multi-bit flag set to 1, a count of 7 1's, and no terminal 0 because all bits are used; 7 10xxxxxx groups, an additional 2^42 codes).
Adding these in gives a final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any believed lost languages. Angelic or Celestial script anyone?
Also there are single byte codes that are overlooked/ignored in the UTF-8 standard in addition to 254 and 255: 128-191, and a few others. Some are used locally by the keyboard, example code 128 is usually a deleting backspace. The other starting codes (and associated ranges) are invalid for one or more reasons (https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences).
Unicode is firmly married to UTF-8. Unicode specifically supports 2^21 code points (2,097,152 characters) which is exactly the same number of code points supported by UTF-8. Both systems reserve the same 'dead' space and restricted zones for code points etc. ...as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters
From the unicode standard. Unicode FAQ
The Unicode Standard encodes characters in the range U+0000..U+10FFFF,
which amounts to a 21-bit code space.
From the UTF-8 Wikipedia page. UTF-8 Description
Since the restriction of the Unicode code-space to 21-bit values in
2003, UTF-8 is defined to encode code points in one to four bytes, ...