Can I represent arbitrary binary data in WTF-8 (or any extension of UTF-8)? - utf-8

A long time ago, there was a two-byte Unicode encoding UCS-2, but then it was determined that two bytes are sometimes not enough. In order to cram more codepoints into 16 bit, surrogate pairs were introduced in UTF-16. Since Windows started out with UCS-2, it doesn't enforce rules around surrogate pairs in some places, most notably file systems.
Programs that want to use UTF-8 internally have a problem now dealing with these invalid UTF-16 sequences. For this, WTF-8 was developed. It is mostly relaxed UTF-8, but it is able to round-trip invalid surrogate pairs.
Now it seems like it should be possible to relax UTF-8 a bit further, and allow it to represent arbitrary binary data, round-tripping safe. The strings I am thinking about are originally 99.9% either valid UTF-8, or almost valid UTF-16 of the kind WTF-8 can stomach. But occasionally there will be invalid byte sequences thrown in.
WTF-8 defines generalized UTF-8 as:
an encoding of sequences of code points (not restricted to Unicode scalar values) using 8-bit bytes, based on the same underlying algorithm as UTF-8. It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).
Would generalized UTF-8 allow me to store arbitrary 32 bit sequences, and thus arbitrary data? Or is there another way, such as a unicode escape character? Things I don't want to do are base64 encoding or percent-encoding, since I want to leave valid unicode strings unchanged.
Standard disclaimer: I encountered this problem a couple times before, but right now it is an academic question, and I'm just interested in a straight answer how to do this. There is no XY problem :-)

Related

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.
What are surrogate characters in UTF-8?
This is almost like a trick question.
Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).
Approximate answer #2: Invalid (if not paired).
Approximate answer #3: It's not UTF-8; It's Modified UTF-8.
Synopsis: The term doesn't apply to UTF-8.
Unicode codepoints have a range that needs 21 bits of data.
UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.
UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.
#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.
#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).
#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.
Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

How to Decode UTF-8 Text Sequence \ud83e\udd14

I'm reading UTF-8 text that contains "\ud83e\udd14". Reading the specification, it says that U+D800 to U+DFFF are not used. Yet if I run this through a decoder such as Microsoft's System.Web.Helpers.Json.Decode, it yields the correct result of an emoticon of a face with a tongue hanging out. The text originates through Twitter's search api.
My question: how should this sequence be decoded? I'm looking for what the final hex sequence would be and how it is obtained. Thanks for any guidance. If my question isn't clear, please let me know and I will try to improve it.
You are coming at this from an interesting perspective. The first thing to note is that you're dealing with two levels of text: a JSON document and a string within it.
Synopsis: You don't need to write code to decode it. Use a library that deserializes JSON into objects, such as Newtonsoft's JSON.Net.
But, first, Unicode. Unicode is a character set with a bit of a history. Unlike almost every character set, 1) it has more than one encoding, and 2) it is still growing. A couple of decades ago, it had <65636 codepoints and that was thought to be enough. So, encoding each codepoint with as 2-byte integer was the plan. It was called UCS-2 or, simply, the Unicode encoding. (Microsoft has stuck with Encoding.Unicode in .NET, which causes some confusion.)
Aside: Codepoints are identified for discussion using the U+ABCD (hexadecimal) format.
Then the Unicode consortium decided to add more codepoints: all the way to U+10FFFF. For that, encodings need at least 21 bits. UTF-32, integers with 32 bits, is an obvious solution but not very dense. So, encodings that use a variable number of code units where invented. UTF-8 uses one to four 8-bit code units, depending on the codepoint.
But a lot of languages were adopting UCS-2 in the 1990s. Documents, of course, can be transformed at will but code that processes UCS-2 would break without a compatible encoding for the expanded character set. Since U+D800 to U+DFFF where unassigned, UCS-2 could stay the same and those "surrogate codepoints" could be used to encode new codepoints. The result is UTF-16. Each codepoint is encoded in one or two 16-bit code units. So, programs that processed UCS-2 could automatically process UTF-16 as long as they didn't need to understand it. Programs written in the same system could be considered to be processing UTF-16, especially with libraries that do understand it. There is still the hazard of things like string length giving the number of UTF-16 code units rather than the number of codepoints, but it has otherwise worked out well.
As for the \ud83e\udd14 notation, languages use Unicode in their syntax or literal strings desired a way to accept source files in a non-Unicode encoding and still support all the Unicode codepoints. Being designed in the 1990s, they simply wrote the UCS-2 code units in hexadecimal. Of course, that too is extended to UTF-16. This UTF-16 code unit escaped syntax allows intermediary systems to handle source code files with a non-Unicode encoding.
Now, JSON is based on JavaScript and JavaScript's strings are sequences of UTF-16 code units. So JSON has adopted th UTF-16 code unit escaped syntax from JavaScript. However, it's not very useful (unless you have to deal with intermediary systems that can't be made to use UTF-8 or treat files they don't understand as binary). The old JSON standard requires JSON documents exchanged between systems to be encoded with UTF-8, UTF-16 or UTF-32. The new RFC8259 requires UTF-8.
So, you don't have "UTF-8 text", you have Unicode text encoding with UTF-8. The text itself is a JSON document. JSON documents have names and values that are Unicode text as sequences of UTF-16 code units with escapes allowed. Your document has the codepoint U+1F914 written, not as "🤔" but as "\ud83e\udd14".
There are plenty of libraries that transform JSON to objects so you shouldn't need to decode the names or values in a JSON document. To do it manually, you'd recognize the escape prefix and take the next 4 characters as the bits of a surrogate, extracting the data bits, then combine them with the bits from the paired surrogate that should follow.
Thought I'd read up on UTF-16 to see if it gave me any clues, and it turns out this is what it calls a surrogate pair. The hex formula for decoding is:
(H - D800) * 400 + (L - DC00) + 10000
where H is the first (high) codepoint and L is the second (low) codepoint.
So \ud83e\udd14 becomes 1f914
Apparently UTF-8 decoders must anticipate UTF-16 surrogate pairs.

What is meaning of assume char set is ASCII?

I was solving below problem while reading its solution in first line I read this
can anyone help me in explaining assume char set is ASCII **I Don't want any other solution for this problem I just want to understand the statement **
Implement an algorithm to determine if a string has all unique characters. What if you can not use additional data structures
Thanks in advance for the help.
There is no text but encoded text.
Text is a sequence of "characters", members of a character set. A character set is a one-to-one mapping between a notional character and a non-negative integer, called a codepoint.
An encoding is a mapping between a codepoint and a sequence of bytes.
Examples:
ASCII, 128 codepoints, one encoding
OEM437, 256 codepoints, one encoding
Windows-1252, 251 codepoints, one encoding
ISO-8859-1, 256 codepoints, one encoding
Unicode, 1,114,112 codepoints, many encodings: UTF-8, UTF-16, UTF-32,…
When you receive a byte stream or read a file that represents text, you have to know the character set and encoding. Conversely, when you send a byte stream or write a file that represents text, you have let the receiver know the character set and encoding. Otherwise, you have a failed communication.
Note: Program source code is almost always text files. So, this communication requirement also applies between you, your editor/IDE and your compiler.
Note: Program console input and output are text streams. So, this communication requirement also applies between the program, its libraries and your console (shell). Go locale or chcp to find out what the encoding is.
Many character sets are a superset of ASCII and some encodings map the same characters with the same byte sequences. This causes a lot of confusion, limits learning, promotes usage of poor terminology and the partial interoperablity leads to buggy code. A deliberate approach to specifications and coding eliminates that.
Examples:
Some people say "ASCII" when they mean the common subset of characters between ASCII and the character set they are actually using. In Unicode and elsewhere this is called C0 Controls and Basic Latin.
Some people say "ASCII Code" when they just mean codepoint or the codepoint's encoded bytes (or code units).
The context of your question is unclear but the statement is trying to say that the distinct characters in your data are in the ASCII character set and therefore their number is less than or equal to 128. Due to the similarity between character sets, you can assume that the codepoint range you need to be concerned about is 0 to 127. (Put comments, asserts or exceptions as applicable in your code to make that clear to readers and provide some runtime checking.)
What this means in your programming language depends on the programming language and its libraries. Many modern programming languages use UTF-16 to represent strings and UTF-8 for streams and files. Programs are often built with standard libraries that account for the console's encoding (actual or assumed) when reading or writing from the console.
So, if your data comes from a file, you must read it using the correct encoding. If your data comes from a console, your program's standard libraries will possibly change encodings from the console's encoding to the encoding of the language's or standard library's native character and string datatypes. If your data comes from a source code file, you have to save it in one specific encoding and tell the compiler what that is. (Usually, you would use the default source code encoding assumed by the compiler because that generally doesn't change from system to system or person to person.)
The "additional" data structures bit probably refers to what a language's standard libraries provide, such as list, map or dictionary. Use what you've been taught so far, like maybe just an array. Of course, you can just ask.
Basically, assume that character codes will be within the range 0-127. You won't need to deal with crazy accented characters.
More than likely though, they won't use many, if any codes below 32; since those are mostly non-printables.
Characters such as 'a' 'b' '1' or '#' are encoded into a binary number when stored and used by a computer.
e.g.
'a' = 1100001
'b' = 1100010
There are a number of different standards that you could use for this encoding. ASCII is one of those standards. The other most common standard is called UTF-8.
Not all characters can be encoded by all standards. ASCII has a much more limited set of characters than UTF-8. As such an encoding also defines the set of characters "char set" that are supported by that encoding.
ASCII encodes each character into a single byte. It supports the letters A-Z, and lowercase a-z, the digits 0-9, a small number of familiar symbols, and a number of control characters that were used in early communication protocols.
The full set of characters supported by ASCII can be seen here: https://en.wikipedia.org/wiki/ASCII

Which Languages Does UTF-8 Not Support?

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road.
I see references for UTF-8, UTF-16 and UTF-32. My question is two parts:
What languages does UTF-8 not support?
What advantages do UTF-16 and UTF-32 have over UTF-8?
If UTF-8 works for everything, then I'm curious what the advantage of UTF-16 and UTF-32 are (e.g. special search features in a database, etc) Having the understanding should help me finish designing my program (and database connections) properly. Thanks!
All three are just different ways to represent the same thing, so there are no languages supported by one and not another.
Sometimes UTF-16 is used by a system that you need to interoperate with - for instance, the Windows API uses UTF-16 natively.
In theory, UTF-32 can represent any "character" in a single 32-bit integer without ever needing to use more than one, whereas UTF-8 and UTF-16 need to use more than one 8-bit or 16-bit integer to do that. But in practise, with combining and non-combining variants of some codepoints, that's not really true.
One advantage of UTF-8 over the others is that if you have a bug whereby you're assuming that the number of 8-, 16- or 32-bit integers respectively is the same as the number of codepoints, it becomes obvious more quickly with UTF-8 - something will fail as soon as you have any non-ASCII codepoint in there, whereas with UTF-16 the bug can go unnoticed.
To answer your first question, here's a list of scripts currently unsupported by Unicode: http://www.unicode.org/standard/unsupported.html
UTF8 is variable 1 to 4 bytes, UTF16 2 or 4 bytes, UTF32 is fixed 4 bytes.
That is why UTF-8 has an advantage where ASCII are most prevalent characters, UTF-16 is better where ASCII is not predominant, UTF-32 will cover all possible characters in 4 bytes.

What Character Encoding is best for multinational companies

If you had a website that was to be translated into every language in the world and therefore had a database with all these translations what character encoding would be best? UTF-128?
If so do all browsers understand the chosen encoding?
Is character encoding straight forward to implement or are there hidden factors?
Thanks in advance.
If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard:
Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]
Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]
UTF-8 and Windows-1252 are the only encodings required to be supported by browsers, and UTF-8 and UTF-16 are the only encodings required to be supported by XML parsers. UTF-8 is thus the only common encoding that everything is required to support.
The following is more of an expanded response to Liv's answer than an answer on its own; it's a description of why UTF-8 is preferable to UTF-16 even for CJK content.
For characters in the ASCII range, UTF-8 is more compact (1 byte vs 2) than UTF-16. For characters between the ASCII range and U+07FF (which includes Latin Extended, Cyrillic, Greek, Arabic, and Hebrew), UTF-8 also uses two bytes per character, so it's a wash. For characters outside the Basic Multilingual Plane, both UTF-8 and UTF-16 use 4 bytes per character, so it's a wash there.
The only range in which UTF-16 is more efficient than UTF-8 is for characters from U+07FF to U+FFFF, which includes Indic alphabets and CJK. Even for a lot of text in that range, UTF-8 winds up being comparable, because the markup of that text (HTML, XML, RTF, or what have you) is all in the ASCII range, for which UTF-8 is half the size of UTF-16.
For example, if I pick a random web page in Japanese, the home page of nhk.or.jp, it is encoded in UTF-8. If I transcode it to UTF-16, it grows to almost twice its original size:
$ curl -o nhk.html 'http://www.nhk.or.jp/'
$ iconv -f UTF-8 -t UTF-16 nhk.html > nhk.16.html
$ ls -al nhk*
-rw-r--r-- 1 lambda lambda 32416 Mar 13 13:06 nhk.16.html
-rw-r--r-- 1 lambda lambda 18337 Mar 13 13:04 nhk.html
UTF-8 is better in almost every way than UTF-16. Both of them are variable width encodings, and so have the complexity that entails. In UTF-16, however, 4 byte characters are fairly uncommon, so it's a lot easier to make fixed width assumptions and have everything work until you run into a corner case that you didn't catch. An example of this confusion can be seen in the encoding CESU-8, which is what you get if you convert UTF-16 text into UTF-8 by just encoding each half of a surrogate pair as a separate character (using 6 bytes per character; three bytes to encode each half of the surrogate pair in UTF-8), instead of decoding the pair to its codepoint and encoding that into UTF-8. This confusion is common enough that the mistaken encoding has actually been standardized so that at least broken programs can be made to interoperate.
UTF-8 is much smaller than UTF-16 for the vast majority of content, and if you're concerned about size, compressing your text will always do better than just picking a different encoding. UTF-8 is compatible with APIs and data structures that use a null-terminated sequence of bytes to represent strings, so as long as your APIs and data structures either don't care about encoding or can already handle different encodings in their strings (such as most C and POSIX string handling APIs), UTF-8 can work just fine without having to have a whole new set of APIs and data structures for wide characters. UTF-16 doesn't specify endianness, so it makes you deal with endianness issues; actually there are three different related encodings, UTF-16, UTF-16BE, and UTF-16LE. UTF-16 could be either big endian or little endian, and so requires a BOM to specify. UTF-16BE and LE are big and little endian versions, with no BOM, so you need to use an out-of-band method (such as a Content-Type HTTP header) to signal which one you're using, but out-of-band headers are notorious for being wrong or missing.
UTF-16 is basically an accident, that happened because people thought that 16 bits would be enough to encode all of Unicode at first, and so started changing their representation and APIs to use wide (16 bit) characters. When they realized they would need more characters, they came up with a scheme for using some reserved characters for encoding 32 bit values using two code units, so they could still use the same data structures for the new encoding. This brought all of the disadvantages of a variable-width encoding like UTF-8, without most of the advantages.
UTF-8 is the de facto standard character encoding for Unicode.
UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII. And it has the advantage of avoiding the complications of endianness and the resulting need to use byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.
There is no such thing as UTF-128.
You need to take more into consideration when dealing with this.
For instance you can represent chinese, japanese and pretty much everything in UTF-8 -- but it will use a set of escape characters for each such "foreign" character -- and as such your data representation might take a lot of storage due to these extra markers. You could look at UTF-16 as well which doesn't need escape/markers for the likes of chinese, japanese and so on -- however, each character takes now 2 bytes to represent; so if you're dealing mainly with Latin charsets you've just doubled the size of your data storage with no benefit. There's also shift-jis dedicated for Japanese which represents these charset better than UTF-8 or UTF-16 but then you don't have support for Latin chars.
I would say, if you know upfront you will have a lot of foreign characters, consider UTF-16; if you're mainly dealing with accents and Latin chars, use UTF-8; if you won't be using any Latin characters then consider shift-jis and the likes.

Resources