Which Languages Does UTF-8 Not Support? - utf-8

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road.
I see references for UTF-8, UTF-16 and UTF-32. My question is two parts:
What languages does UTF-8 not support?
What advantages do UTF-16 and UTF-32 have over UTF-8?
If UTF-8 works for everything, then I'm curious what the advantage of UTF-16 and UTF-32 are (e.g. special search features in a database, etc) Having the understanding should help me finish designing my program (and database connections) properly. Thanks!

All three are just different ways to represent the same thing, so there are no languages supported by one and not another.
Sometimes UTF-16 is used by a system that you need to interoperate with - for instance, the Windows API uses UTF-16 natively.
In theory, UTF-32 can represent any "character" in a single 32-bit integer without ever needing to use more than one, whereas UTF-8 and UTF-16 need to use more than one 8-bit or 16-bit integer to do that. But in practise, with combining and non-combining variants of some codepoints, that's not really true.
One advantage of UTF-8 over the others is that if you have a bug whereby you're assuming that the number of 8-, 16- or 32-bit integers respectively is the same as the number of codepoints, it becomes obvious more quickly with UTF-8 - something will fail as soon as you have any non-ASCII codepoint in there, whereas with UTF-16 the bug can go unnoticed.
To answer your first question, here's a list of scripts currently unsupported by Unicode: http://www.unicode.org/standard/unsupported.html

UTF8 is variable 1 to 4 bytes, UTF16 2 or 4 bytes, UTF32 is fixed 4 bytes.
That is why UTF-8 has an advantage where ASCII are most prevalent characters, UTF-16 is better where ASCII is not predominant, UTF-32 will cover all possible characters in 4 bytes.

Related

Can I represent arbitrary binary data in WTF-8 (or any extension of UTF-8)?

A long time ago, there was a two-byte Unicode encoding UCS-2, but then it was determined that two bytes are sometimes not enough. In order to cram more codepoints into 16 bit, surrogate pairs were introduced in UTF-16. Since Windows started out with UCS-2, it doesn't enforce rules around surrogate pairs in some places, most notably file systems.
Programs that want to use UTF-8 internally have a problem now dealing with these invalid UTF-16 sequences. For this, WTF-8 was developed. It is mostly relaxed UTF-8, but it is able to round-trip invalid surrogate pairs.
Now it seems like it should be possible to relax UTF-8 a bit further, and allow it to represent arbitrary binary data, round-tripping safe. The strings I am thinking about are originally 99.9% either valid UTF-8, or almost valid UTF-16 of the kind WTF-8 can stomach. But occasionally there will be invalid byte sequences thrown in.
WTF-8 defines generalized UTF-8 as:
an encoding of sequences of code points (not restricted to Unicode scalar values) using 8-bit bytes, based on the same underlying algorithm as UTF-8. It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).
Would generalized UTF-8 allow me to store arbitrary 32 bit sequences, and thus arbitrary data? Or is there another way, such as a unicode escape character? Things I don't want to do are base64 encoding or percent-encoding, since I want to leave valid unicode strings unchanged.
Standard disclaimer: I encountered this problem a couple times before, but right now it is an academic question, and I'm just interested in a straight answer how to do this. There is no XY problem :-)

How to Decode UTF-8 Text Sequence \ud83e\udd14

I'm reading UTF-8 text that contains "\ud83e\udd14". Reading the specification, it says that U+D800 to U+DFFF are not used. Yet if I run this through a decoder such as Microsoft's System.Web.Helpers.Json.Decode, it yields the correct result of an emoticon of a face with a tongue hanging out. The text originates through Twitter's search api.
My question: how should this sequence be decoded? I'm looking for what the final hex sequence would be and how it is obtained. Thanks for any guidance. If my question isn't clear, please let me know and I will try to improve it.
You are coming at this from an interesting perspective. The first thing to note is that you're dealing with two levels of text: a JSON document and a string within it.
Synopsis: You don't need to write code to decode it. Use a library that deserializes JSON into objects, such as Newtonsoft's JSON.Net.
But, first, Unicode. Unicode is a character set with a bit of a history. Unlike almost every character set, 1) it has more than one encoding, and 2) it is still growing. A couple of decades ago, it had <65636 codepoints and that was thought to be enough. So, encoding each codepoint with as 2-byte integer was the plan. It was called UCS-2 or, simply, the Unicode encoding. (Microsoft has stuck with Encoding.Unicode in .NET, which causes some confusion.)
Aside: Codepoints are identified for discussion using the U+ABCD (hexadecimal) format.
Then the Unicode consortium decided to add more codepoints: all the way to U+10FFFF. For that, encodings need at least 21 bits. UTF-32, integers with 32 bits, is an obvious solution but not very dense. So, encodings that use a variable number of code units where invented. UTF-8 uses one to four 8-bit code units, depending on the codepoint.
But a lot of languages were adopting UCS-2 in the 1990s. Documents, of course, can be transformed at will but code that processes UCS-2 would break without a compatible encoding for the expanded character set. Since U+D800 to U+DFFF where unassigned, UCS-2 could stay the same and those "surrogate codepoints" could be used to encode new codepoints. The result is UTF-16. Each codepoint is encoded in one or two 16-bit code units. So, programs that processed UCS-2 could automatically process UTF-16 as long as they didn't need to understand it. Programs written in the same system could be considered to be processing UTF-16, especially with libraries that do understand it. There is still the hazard of things like string length giving the number of UTF-16 code units rather than the number of codepoints, but it has otherwise worked out well.
As for the \ud83e\udd14 notation, languages use Unicode in their syntax or literal strings desired a way to accept source files in a non-Unicode encoding and still support all the Unicode codepoints. Being designed in the 1990s, they simply wrote the UCS-2 code units in hexadecimal. Of course, that too is extended to UTF-16. This UTF-16 code unit escaped syntax allows intermediary systems to handle source code files with a non-Unicode encoding.
Now, JSON is based on JavaScript and JavaScript's strings are sequences of UTF-16 code units. So JSON has adopted th UTF-16 code unit escaped syntax from JavaScript. However, it's not very useful (unless you have to deal with intermediary systems that can't be made to use UTF-8 or treat files they don't understand as binary). The old JSON standard requires JSON documents exchanged between systems to be encoded with UTF-8, UTF-16 or UTF-32. The new RFC8259 requires UTF-8.
So, you don't have "UTF-8 text", you have Unicode text encoding with UTF-8. The text itself is a JSON document. JSON documents have names and values that are Unicode text as sequences of UTF-16 code units with escapes allowed. Your document has the codepoint U+1F914 written, not as "🤔" but as "\ud83e\udd14".
There are plenty of libraries that transform JSON to objects so you shouldn't need to decode the names or values in a JSON document. To do it manually, you'd recognize the escape prefix and take the next 4 characters as the bits of a surrogate, extracting the data bits, then combine them with the bits from the paired surrogate that should follow.
Thought I'd read up on UTF-16 to see if it gave me any clues, and it turns out this is what it calls a surrogate pair. The hex formula for decoding is:
(H - D800) * 400 + (L - DC00) + 10000
where H is the first (high) codepoint and L is the second (low) codepoint.
So \ud83e\udd14 becomes 1f914
Apparently UTF-8 decoders must anticipate UTF-16 surrogate pairs.

Why does USB use UTF-16 for string (why not UTF-8)

UTF-16 require 2 byte and UTF-8 require 1 byte.
and USB is 8bit oriented, UTF-8 is more natural.
UTF-8 is backward compatible with ASCII, UTF-16 isn't.
UTF-16 require 2 byte, so, it could have endianness problem.
(Endianness problem occurred, later it was clearified by USB-IF as little endian.)
UTF-16 and UTF-8 are functionally
but why UTF-16? why not UTF-8?
Comparision of UTF-16 and UTF-8:
https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16
UTF-16 require 2 byte and UTF-8 require 1 byte.
This is wrong on both counts. Both UTF-8 and UTF-16 are variable-length encodings. You might be thinking of UCS-2 instead (UTF-16's predecessor), which did indeed use only 2 bytes (and as such was limited to codepoints up to U+FFFF only).
UTF-8 uses 1 byte for codepoints U+0000 - U+007F, 2 bytes for codepoints U+0080 - U+07FF, 3 bytes for U+0800 - U+FFFF, and 4 bytes for codepoints U+10000 - U+10FFFF.
UTF-16 uses 2 bytes for codepoints U+0000 - U+FFFF, and 4 bytes for codepoints U+10000 - U+10FFFF.
and USB is 8bit oriented, UTF-8 is more natural.
Not really. If you take into account the byte sizes mentioned above, UTF-16 actually handles more codepoints with fewer codeunits than UTF-8 does. But in any case, USB cares more about binary data than human readable text data. Even Unicode strings are prefixed with a byte count, not a character count. So the designers of USB could have used any encoding they wanted, as long as they standardized it. They chose UTF-16LE.
Why? Ask the designers. My guess (and this is just a guess) is because Microsoft co-authored the USB 1.0 specification, and UCS-2 (now UTF-16LE) was Microsoft's encoding of choice for Windows, so they probably wanted to maintain compatibility without involving a lot of runtime conversions. Back then, Windows had almost 90% of the PC market, whereas other OSes, particularly *Nix, only had like 5%. Windows 98 was the first Windows version to have USB baked directly in the OS (USB was an optional add-on in Windows 95), but even then, USB was already becoming popular in PCs before Apple eventually added USB support to iMacs a few years later.
Besides, probably more important, back then UTF-8 was still relatively new (it was only a few years old when USB 1.0 was authored), UCS-2 had been around for awhile and was the primary Unicode encoding at the time (Unicode would not exceed 65536 codepoints for a few more years). So it probably made sense at the time to have USB support international text by using UCS-2 (later UTF-16LE) instead of UTF-8. If they had decided on an 8bit encoding instead, ISO-8859-1 probably would have made more sense than UTF-8 (but by today's standards, ISO-8859-1 doesn't cut it anymore). And by the time Unicode did finally break the 65536-codepoint limit of UCS-2, it was too late to change the encoding to something else without breaking backwards compatibility. At least UTF-16 is backwards compatible with UCS-2 (this is the same reason why Windows is still using UTF-16 and not switch to UTF-8 like some other OSes have done).
UTF-8 is backward compatible with ASCII, UTF-16 isn't.
True.
UTF-16 require 2 byte, so, it could have endianness problem.
True. Same with UTF-32, for that matter.

What Character Encoding is best for multinational companies

If you had a website that was to be translated into every language in the world and therefore had a database with all these translations what character encoding would be best? UTF-128?
If so do all browsers understand the chosen encoding?
Is character encoding straight forward to implement or are there hidden factors?
Thanks in advance.
If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard:
Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]
Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]
UTF-8 and Windows-1252 are the only encodings required to be supported by browsers, and UTF-8 and UTF-16 are the only encodings required to be supported by XML parsers. UTF-8 is thus the only common encoding that everything is required to support.
The following is more of an expanded response to Liv's answer than an answer on its own; it's a description of why UTF-8 is preferable to UTF-16 even for CJK content.
For characters in the ASCII range, UTF-8 is more compact (1 byte vs 2) than UTF-16. For characters between the ASCII range and U+07FF (which includes Latin Extended, Cyrillic, Greek, Arabic, and Hebrew), UTF-8 also uses two bytes per character, so it's a wash. For characters outside the Basic Multilingual Plane, both UTF-8 and UTF-16 use 4 bytes per character, so it's a wash there.
The only range in which UTF-16 is more efficient than UTF-8 is for characters from U+07FF to U+FFFF, which includes Indic alphabets and CJK. Even for a lot of text in that range, UTF-8 winds up being comparable, because the markup of that text (HTML, XML, RTF, or what have you) is all in the ASCII range, for which UTF-8 is half the size of UTF-16.
For example, if I pick a random web page in Japanese, the home page of nhk.or.jp, it is encoded in UTF-8. If I transcode it to UTF-16, it grows to almost twice its original size:
$ curl -o nhk.html 'http://www.nhk.or.jp/'
$ iconv -f UTF-8 -t UTF-16 nhk.html > nhk.16.html
$ ls -al nhk*
-rw-r--r-- 1 lambda lambda 32416 Mar 13 13:06 nhk.16.html
-rw-r--r-- 1 lambda lambda 18337 Mar 13 13:04 nhk.html
UTF-8 is better in almost every way than UTF-16. Both of them are variable width encodings, and so have the complexity that entails. In UTF-16, however, 4 byte characters are fairly uncommon, so it's a lot easier to make fixed width assumptions and have everything work until you run into a corner case that you didn't catch. An example of this confusion can be seen in the encoding CESU-8, which is what you get if you convert UTF-16 text into UTF-8 by just encoding each half of a surrogate pair as a separate character (using 6 bytes per character; three bytes to encode each half of the surrogate pair in UTF-8), instead of decoding the pair to its codepoint and encoding that into UTF-8. This confusion is common enough that the mistaken encoding has actually been standardized so that at least broken programs can be made to interoperate.
UTF-8 is much smaller than UTF-16 for the vast majority of content, and if you're concerned about size, compressing your text will always do better than just picking a different encoding. UTF-8 is compatible with APIs and data structures that use a null-terminated sequence of bytes to represent strings, so as long as your APIs and data structures either don't care about encoding or can already handle different encodings in their strings (such as most C and POSIX string handling APIs), UTF-8 can work just fine without having to have a whole new set of APIs and data structures for wide characters. UTF-16 doesn't specify endianness, so it makes you deal with endianness issues; actually there are three different related encodings, UTF-16, UTF-16BE, and UTF-16LE. UTF-16 could be either big endian or little endian, and so requires a BOM to specify. UTF-16BE and LE are big and little endian versions, with no BOM, so you need to use an out-of-band method (such as a Content-Type HTTP header) to signal which one you're using, but out-of-band headers are notorious for being wrong or missing.
UTF-16 is basically an accident, that happened because people thought that 16 bits would be enough to encode all of Unicode at first, and so started changing their representation and APIs to use wide (16 bit) characters. When they realized they would need more characters, they came up with a scheme for using some reserved characters for encoding 32 bit values using two code units, so they could still use the same data structures for the new encoding. This brought all of the disadvantages of a variable-width encoding like UTF-8, without most of the advantages.
UTF-8 is the de facto standard character encoding for Unicode.
UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII. And it has the advantage of avoiding the complications of endianness and the resulting need to use byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.
There is no such thing as UTF-128.
You need to take more into consideration when dealing with this.
For instance you can represent chinese, japanese and pretty much everything in UTF-8 -- but it will use a set of escape characters for each such "foreign" character -- and as such your data representation might take a lot of storage due to these extra markers. You could look at UTF-16 as well which doesn't need escape/markers for the likes of chinese, japanese and so on -- however, each character takes now 2 bytes to represent; so if you're dealing mainly with Latin charsets you've just doubled the size of your data storage with no benefit. There's also shift-jis dedicated for Japanese which represents these charset better than UTF-8 or UTF-16 but then you don't have support for Latin chars.
I would say, if you know upfront you will have a lot of foreign characters, consider UTF-16; if you're mainly dealing with accents and Latin chars, use UTF-8; if you won't be using any Latin characters then consider shift-jis and the likes.

What is a multibyte character set?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets?
The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character).
Shift-jis, jis, euc-jp, euc-kr, along with Chinese encodings are typically included.
Most of the legacy encodings, with some exceptions, require a sort of state machine model (or, more simply, a page swapping model) to process, and moving backwards in a text stream is complicated and error-prone. UTF-8 and UTF-16 do not suffer from this problem, as UTF-8 can be tested with a bitmask and UTF-16 can be tested against a range of surrogate pairs, so moving backward and forward in a non-pathological document can be done safely without major complexity.
A few legacy encodings, for languages like Thai and Vietnamese, have some of the complexity of multibyte character sets but are really just built on combining characters, and aren't generally lumped in with the broad term "multibyte."
What is meant if anybody talks about multibyte character sets?
That, as usual, depends on who is doing the talking!
Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).
But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.
Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!
My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.
All character sets where you dont have a 1 byte = 1 character mapping. All Unicode variants, but also asian character sets are multibyte.
For more information, I suggest reading this Wikipedia article.
A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.
References:
IBM: Multibyte Characters
Unicode and MultiByte Character Set (archived), Unicode and Multibyte Character Set (MBCS) Support | Microsoft Docs
Unicode Consortium Website
A multibyte character set may consist of both one-byte and two-byte
characters. Thus a multibyte-character string may contain a mixture of
single-byte and double-byte characters.
Ref: Single-Byte and Multibyte Character Sets
UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.
UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().
UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.
Typically the former, i.e. UTF-8-like. For more info, see Variable-width encoding.
The former - although the term "variable-length encoding" would be more appropriate.
I generally use it to refer to any character that can have more than one byte per character.

Resources