What Character Encoding is best for multinational companies - utf-8

If you had a website that was to be translated into every language in the world and therefore had a database with all these translations what character encoding would be best? UTF-128?
If so do all browsers understand the chosen encoding?
Is character encoding straight forward to implement or are there hidden factors?
Thanks in advance.

If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard:
Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]
Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]
UTF-8 and Windows-1252 are the only encodings required to be supported by browsers, and UTF-8 and UTF-16 are the only encodings required to be supported by XML parsers. UTF-8 is thus the only common encoding that everything is required to support.
The following is more of an expanded response to Liv's answer than an answer on its own; it's a description of why UTF-8 is preferable to UTF-16 even for CJK content.
For characters in the ASCII range, UTF-8 is more compact (1 byte vs 2) than UTF-16. For characters between the ASCII range and U+07FF (which includes Latin Extended, Cyrillic, Greek, Arabic, and Hebrew), UTF-8 also uses two bytes per character, so it's a wash. For characters outside the Basic Multilingual Plane, both UTF-8 and UTF-16 use 4 bytes per character, so it's a wash there.
The only range in which UTF-16 is more efficient than UTF-8 is for characters from U+07FF to U+FFFF, which includes Indic alphabets and CJK. Even for a lot of text in that range, UTF-8 winds up being comparable, because the markup of that text (HTML, XML, RTF, or what have you) is all in the ASCII range, for which UTF-8 is half the size of UTF-16.
For example, if I pick a random web page in Japanese, the home page of nhk.or.jp, it is encoded in UTF-8. If I transcode it to UTF-16, it grows to almost twice its original size:
$ curl -o nhk.html 'http://www.nhk.or.jp/'
$ iconv -f UTF-8 -t UTF-16 nhk.html > nhk.16.html
$ ls -al nhk*
-rw-r--r-- 1 lambda lambda 32416 Mar 13 13:06 nhk.16.html
-rw-r--r-- 1 lambda lambda 18337 Mar 13 13:04 nhk.html
UTF-8 is better in almost every way than UTF-16. Both of them are variable width encodings, and so have the complexity that entails. In UTF-16, however, 4 byte characters are fairly uncommon, so it's a lot easier to make fixed width assumptions and have everything work until you run into a corner case that you didn't catch. An example of this confusion can be seen in the encoding CESU-8, which is what you get if you convert UTF-16 text into UTF-8 by just encoding each half of a surrogate pair as a separate character (using 6 bytes per character; three bytes to encode each half of the surrogate pair in UTF-8), instead of decoding the pair to its codepoint and encoding that into UTF-8. This confusion is common enough that the mistaken encoding has actually been standardized so that at least broken programs can be made to interoperate.
UTF-8 is much smaller than UTF-16 for the vast majority of content, and if you're concerned about size, compressing your text will always do better than just picking a different encoding. UTF-8 is compatible with APIs and data structures that use a null-terminated sequence of bytes to represent strings, so as long as your APIs and data structures either don't care about encoding or can already handle different encodings in their strings (such as most C and POSIX string handling APIs), UTF-8 can work just fine without having to have a whole new set of APIs and data structures for wide characters. UTF-16 doesn't specify endianness, so it makes you deal with endianness issues; actually there are three different related encodings, UTF-16, UTF-16BE, and UTF-16LE. UTF-16 could be either big endian or little endian, and so requires a BOM to specify. UTF-16BE and LE are big and little endian versions, with no BOM, so you need to use an out-of-band method (such as a Content-Type HTTP header) to signal which one you're using, but out-of-band headers are notorious for being wrong or missing.
UTF-16 is basically an accident, that happened because people thought that 16 bits would be enough to encode all of Unicode at first, and so started changing their representation and APIs to use wide (16 bit) characters. When they realized they would need more characters, they came up with a scheme for using some reserved characters for encoding 32 bit values using two code units, so they could still use the same data structures for the new encoding. This brought all of the disadvantages of a variable-width encoding like UTF-8, without most of the advantages.

UTF-8 is the de facto standard character encoding for Unicode.
UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII. And it has the advantage of avoiding the complications of endianness and the resulting need to use byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.
There is no such thing as UTF-128.

You need to take more into consideration when dealing with this.
For instance you can represent chinese, japanese and pretty much everything in UTF-8 -- but it will use a set of escape characters for each such "foreign" character -- and as such your data representation might take a lot of storage due to these extra markers. You could look at UTF-16 as well which doesn't need escape/markers for the likes of chinese, japanese and so on -- however, each character takes now 2 bytes to represent; so if you're dealing mainly with Latin charsets you've just doubled the size of your data storage with no benefit. There's also shift-jis dedicated for Japanese which represents these charset better than UTF-8 or UTF-16 but then you don't have support for Latin chars.
I would say, if you know upfront you will have a lot of foreign characters, consider UTF-16; if you're mainly dealing with accents and Latin chars, use UTF-8; if you won't be using any Latin characters then consider shift-jis and the likes.

Related

Can I represent arbitrary binary data in WTF-8 (or any extension of UTF-8)?

A long time ago, there was a two-byte Unicode encoding UCS-2, but then it was determined that two bytes are sometimes not enough. In order to cram more codepoints into 16 bit, surrogate pairs were introduced in UTF-16. Since Windows started out with UCS-2, it doesn't enforce rules around surrogate pairs in some places, most notably file systems.
Programs that want to use UTF-8 internally have a problem now dealing with these invalid UTF-16 sequences. For this, WTF-8 was developed. It is mostly relaxed UTF-8, but it is able to round-trip invalid surrogate pairs.
Now it seems like it should be possible to relax UTF-8 a bit further, and allow it to represent arbitrary binary data, round-tripping safe. The strings I am thinking about are originally 99.9% either valid UTF-8, or almost valid UTF-16 of the kind WTF-8 can stomach. But occasionally there will be invalid byte sequences thrown in.
WTF-8 defines generalized UTF-8 as:
an encoding of sequences of code points (not restricted to Unicode scalar values) using 8-bit bytes, based on the same underlying algorithm as UTF-8. It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).
Would generalized UTF-8 allow me to store arbitrary 32 bit sequences, and thus arbitrary data? Or is there another way, such as a unicode escape character? Things I don't want to do are base64 encoding or percent-encoding, since I want to leave valid unicode strings unchanged.
Standard disclaimer: I encountered this problem a couple times before, but right now it is an academic question, and I'm just interested in a straight answer how to do this. There is no XY problem :-)

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.
What are surrogate characters in UTF-8?
This is almost like a trick question.
Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).
Approximate answer #2: Invalid (if not paired).
Approximate answer #3: It's not UTF-8; It's Modified UTF-8.
Synopsis: The term doesn't apply to UTF-8.
Unicode codepoints have a range that needs 21 bits of data.
UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.
UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.
#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.
#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).
#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.
Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

How to Decode UTF-8 Text Sequence \ud83e\udd14

I'm reading UTF-8 text that contains "\ud83e\udd14". Reading the specification, it says that U+D800 to U+DFFF are not used. Yet if I run this through a decoder such as Microsoft's System.Web.Helpers.Json.Decode, it yields the correct result of an emoticon of a face with a tongue hanging out. The text originates through Twitter's search api.
My question: how should this sequence be decoded? I'm looking for what the final hex sequence would be and how it is obtained. Thanks for any guidance. If my question isn't clear, please let me know and I will try to improve it.
You are coming at this from an interesting perspective. The first thing to note is that you're dealing with two levels of text: a JSON document and a string within it.
Synopsis: You don't need to write code to decode it. Use a library that deserializes JSON into objects, such as Newtonsoft's JSON.Net.
But, first, Unicode. Unicode is a character set with a bit of a history. Unlike almost every character set, 1) it has more than one encoding, and 2) it is still growing. A couple of decades ago, it had <65636 codepoints and that was thought to be enough. So, encoding each codepoint with as 2-byte integer was the plan. It was called UCS-2 or, simply, the Unicode encoding. (Microsoft has stuck with Encoding.Unicode in .NET, which causes some confusion.)
Aside: Codepoints are identified for discussion using the U+ABCD (hexadecimal) format.
Then the Unicode consortium decided to add more codepoints: all the way to U+10FFFF. For that, encodings need at least 21 bits. UTF-32, integers with 32 bits, is an obvious solution but not very dense. So, encodings that use a variable number of code units where invented. UTF-8 uses one to four 8-bit code units, depending on the codepoint.
But a lot of languages were adopting UCS-2 in the 1990s. Documents, of course, can be transformed at will but code that processes UCS-2 would break without a compatible encoding for the expanded character set. Since U+D800 to U+DFFF where unassigned, UCS-2 could stay the same and those "surrogate codepoints" could be used to encode new codepoints. The result is UTF-16. Each codepoint is encoded in one or two 16-bit code units. So, programs that processed UCS-2 could automatically process UTF-16 as long as they didn't need to understand it. Programs written in the same system could be considered to be processing UTF-16, especially with libraries that do understand it. There is still the hazard of things like string length giving the number of UTF-16 code units rather than the number of codepoints, but it has otherwise worked out well.
As for the \ud83e\udd14 notation, languages use Unicode in their syntax or literal strings desired a way to accept source files in a non-Unicode encoding and still support all the Unicode codepoints. Being designed in the 1990s, they simply wrote the UCS-2 code units in hexadecimal. Of course, that too is extended to UTF-16. This UTF-16 code unit escaped syntax allows intermediary systems to handle source code files with a non-Unicode encoding.
Now, JSON is based on JavaScript and JavaScript's strings are sequences of UTF-16 code units. So JSON has adopted th UTF-16 code unit escaped syntax from JavaScript. However, it's not very useful (unless you have to deal with intermediary systems that can't be made to use UTF-8 or treat files they don't understand as binary). The old JSON standard requires JSON documents exchanged between systems to be encoded with UTF-8, UTF-16 or UTF-32. The new RFC8259 requires UTF-8.
So, you don't have "UTF-8 text", you have Unicode text encoding with UTF-8. The text itself is a JSON document. JSON documents have names and values that are Unicode text as sequences of UTF-16 code units with escapes allowed. Your document has the codepoint U+1F914 written, not as "🤔" but as "\ud83e\udd14".
There are plenty of libraries that transform JSON to objects so you shouldn't need to decode the names or values in a JSON document. To do it manually, you'd recognize the escape prefix and take the next 4 characters as the bits of a surrogate, extracting the data bits, then combine them with the bits from the paired surrogate that should follow.
Thought I'd read up on UTF-16 to see if it gave me any clues, and it turns out this is what it calls a surrogate pair. The hex formula for decoding is:
(H - D800) * 400 + (L - DC00) + 10000
where H is the first (high) codepoint and L is the second (low) codepoint.
So \ud83e\udd14 becomes 1f914
Apparently UTF-8 decoders must anticipate UTF-16 surrogate pairs.

Which Languages Does UTF-8 Not Support?

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road.
I see references for UTF-8, UTF-16 and UTF-32. My question is two parts:
What languages does UTF-8 not support?
What advantages do UTF-16 and UTF-32 have over UTF-8?
If UTF-8 works for everything, then I'm curious what the advantage of UTF-16 and UTF-32 are (e.g. special search features in a database, etc) Having the understanding should help me finish designing my program (and database connections) properly. Thanks!
All three are just different ways to represent the same thing, so there are no languages supported by one and not another.
Sometimes UTF-16 is used by a system that you need to interoperate with - for instance, the Windows API uses UTF-16 natively.
In theory, UTF-32 can represent any "character" in a single 32-bit integer without ever needing to use more than one, whereas UTF-8 and UTF-16 need to use more than one 8-bit or 16-bit integer to do that. But in practise, with combining and non-combining variants of some codepoints, that's not really true.
One advantage of UTF-8 over the others is that if you have a bug whereby you're assuming that the number of 8-, 16- or 32-bit integers respectively is the same as the number of codepoints, it becomes obvious more quickly with UTF-8 - something will fail as soon as you have any non-ASCII codepoint in there, whereas with UTF-16 the bug can go unnoticed.
To answer your first question, here's a list of scripts currently unsupported by Unicode: http://www.unicode.org/standard/unsupported.html
UTF8 is variable 1 to 4 bytes, UTF16 2 or 4 bytes, UTF32 is fixed 4 bytes.
That is why UTF-8 has an advantage where ASCII are most prevalent characters, UTF-16 is better where ASCII is not predominant, UTF-32 will cover all possible characters in 4 bytes.

What is a multibyte character set?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets?
The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character).
Shift-jis, jis, euc-jp, euc-kr, along with Chinese encodings are typically included.
Most of the legacy encodings, with some exceptions, require a sort of state machine model (or, more simply, a page swapping model) to process, and moving backwards in a text stream is complicated and error-prone. UTF-8 and UTF-16 do not suffer from this problem, as UTF-8 can be tested with a bitmask and UTF-16 can be tested against a range of surrogate pairs, so moving backward and forward in a non-pathological document can be done safely without major complexity.
A few legacy encodings, for languages like Thai and Vietnamese, have some of the complexity of multibyte character sets but are really just built on combining characters, and aren't generally lumped in with the broad term "multibyte."
What is meant if anybody talks about multibyte character sets?
That, as usual, depends on who is doing the talking!
Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).
But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.
Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!
My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.
All character sets where you dont have a 1 byte = 1 character mapping. All Unicode variants, but also asian character sets are multibyte.
For more information, I suggest reading this Wikipedia article.
A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.
References:
IBM: Multibyte Characters
Unicode and MultiByte Character Set (archived), Unicode and Multibyte Character Set (MBCS) Support | Microsoft Docs
Unicode Consortium Website
A multibyte character set may consist of both one-byte and two-byte
characters. Thus a multibyte-character string may contain a mixture of
single-byte and double-byte characters.
Ref: Single-Byte and Multibyte Character Sets
UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.
UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().
UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.
Typically the former, i.e. UTF-8-like. For more info, see Variable-width encoding.
The former - although the term "variable-length encoding" would be more appropriate.
I generally use it to refer to any character that can have more than one byte per character.

Resources