Character Set Special Characters - utf-8

Is iso-8859-1 a proper subset of utf-8?
What about iso-8859-n?
What about windows-1252?
If the answer is no to any of the above, what are the disjoint characters? I'm testing some logic that detects charsets and want to write tests to verify the detection is working properly.

Is iso-8859-1 a proper subset of utf-8?
The character reportoire of ISO-8859-1 (the first 256 characters of Unicode) is a proper subset of that of UTF-8 (every Unicode character).
However, the characters U+0080 to U+00FF are encoded differently in the two encodings.
ISO-8859-1 assigns each of these characters a single byte from 80 to FF.
UTF-8 encodes the same characters as two-byte sequences C2 80 to C3 BF.
What about iso-8859-n?
These are 15 different encodings that contain a total of 614 distinct characters. Some of these characters occur in multiple "parts" of ISO 8859, and some don't. You'll have to be more specific.
I see that your question is tagged ISO-8859-2. The characters that are in -2 that aren't in -1 are:
Ă㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŔŕŘřŚśŞşŠšŢţŤťŮůŰűŹźŻżŽžˇ˘˙˛˝
What about windows-1252?
Windows-1252 is just like ISO-8859-1 except that it replaces the rarely used control characters in the 0x80-0x9F range with printable characters. The characters that are in windows-1252 but not in ISO-8859-1 are:
ŒœŠšŸŽžƒˆ˜–—‘’‚“”„†‡•…‰‹›€™

Unicode is a superset of all these character sets, and of pretty much all established character sets out there. You can find a list of mappings of all these character sets to Unicode code points here: http://unicode.org/Public/MAPPINGS/.

Related

ASCII characters set

I am reading a (.txt) file, the contents of first line are just the four alphabet letters: "abcd".
when I display the ASCII code of these letters, I expect I to find 97,98,99 and 100 respectively for a,b,c and b. But I found tow special characters which their ASCII code are 255 and 254 for ÿ and þ.
Therefore the length of read line is 6 not 4 because of "ÿþabcd". Are these special character must-to-insert at the start of any sequential text file or is there any way to avoid both of them?
ASCII is only used in niche or archaic systems. Your data proved to you that your file is not ASCII. You must find out which character set and encoding the file was stored in.
Character Sets
All text is an encoding of elements of a character set. Elements of a character set are called codepoints. A character set consists of a list of codepoints and their descriptions. The description states how the codepoint is used semantically in text, such as LATIN CAPITAL LETTER A (A) or N-ARY PRODUCT (∏). (The style in which the codepoint is rendered is the purview of typefaces/fonts.)
Encodings
Codepoints are numbered with non-negative integers. The number is encoded into bytes. Most character sets have only one encoding, which is the number as an unsigned integer in the smallest size that can represent all of the codepoints. For example, Windows-1252 has 251 codepoints, with numbers between 0 and 255. A byte is big enough to represent any of them. The Unicode character set has about 1 million codepoints, numbered from 0 to 1,114,112. A 32-bit integer is big enough to represent all of them. That's the UTF-32 encoding.
Byte order
Computer memory is usually byte-addressable and file are byte sequences so for a large integer the question becomes, in which order the bytes are stored: The most significant byte first (big endian) or least significant byte first (little endian). Software adapts to or assumes one way or the other. So, UTF-32 actually identifies one of two encodings: UTF32BE or UTF-32LE. UTF-32 is shorthand for the endianness that the software assumes. Typically, the OS assumes the endianness of the hardware it is running on and programs follow suit.
Unicode Encodings
UTF-32 takes a lot of space. The most commonly used codepoints are numbered below 65,536. So there can be savings if codepoints are represented variable number of smaller integers. The size of the integer is called a code unit. The value of a code unit contains some of the bits of a code unit and indicates if there are more code units to follow that have more of the bits. So, there are UTF-16LE and UTF-16BE and UTF-8 (and more) encodings for Unicode. UTF-16 uses one or two 16-bit code units for a codepoint and UTF-8 uses one to four 8-bit code units for a codepoint.
Files are data outside of programs. So for a program to read text, it has to know the character set and encoding. Often this metadata is not stored with the file (within or beside). That's how you made the mistake of believing your file is ASCII. If you don't know the encoding of a file, you've lost data. You might be able to recover it through guessing. It is notable that the CP437 character set has 256 codepoints, numbered 0 to 255 and encoded in one byte. So every file can be read as CP437; The question is, is that right? Even if it looks right, it's probably not right unless it's from a Western culture circa 1990.
Unicode Byte-order Mark
A strong clue about which character set and encoding to guess is called the byte-order mark (BOM). Recall encodings with code units larger than one byte have an endianness. Endianness is a hardware concern. So, although a file can be passed between systems with agreement on which character set and encoding scheme is used, the endianness attribute of encoding is critical to each system. It has become standard to indicate byte order within the file, as the first bytes. Unicode specifies a codepoint to use for this purpose, as long as it is at the beginning of the file. (That means that programs reading Unicode from a file must separate this metadata from the data.) Many file writing libraries write the BOM codepoint regardless of the code unit size. So, you'll see it at the beginning of UTF-8 files. Since the Unicode BOM looks different in each of the Unicode encodings, it completely identifies which Unicode encoding is being used.
Guessing
Your file begins with the UTF-16LE BOM. Read it as UTF-16LE (and discard the BOM codepoint if your library doesn't already.)
Given the specificness of the Unicode BOM, its presence is a strong indicator that the file is encoded in Unicode and the actual bytes tell which Unicode encoding. However, as noted above, it's possible that this guess is wrong.
As #Lưu Vĩnh Phúc points out, it is unclear how you are reading "ÿþabcd" from what you say is a 6-byte file. Open the file in a hex editor. UTF-16LE should be FF FE 61 00 62 00 63 00 64 00.

How to check if all letters in UTF-8 string are ASCII?

How to check that all string characters are ASCII? It is sad in documentation:
Unicode characters U+0000 to U+007F (ASCII) are encoded simply as
bytes 00h to 7Fh (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the same
encoding under both ASCII and UTF-8. All characters >U+007F are
encoded as a sequence of several bytes, each of which has the two most
significant bits set.
So I wonder how to check string to be ASCII?
A string is ASCII if all the characters it contains are in the range 0-127.
guava
CharMatcher.ASCII.matchesAllOf(string);
An easy way is to check whether the length of the string in bytes equals the number of Unicode characters (code points). If these values are cached, this might even be the fastest way.

In Ruby - what are encodings?

I'm trying to understand what encodings in Ruby are - there are lots of articles about encodings, such as this one and this one. However, none of them explain the basic question a newbie might have - what is an encoding in the first place?
What is meant by character encoding here is a system of describing how computers represent characters in binary.
In UTF-8 encoding, the character ä is represented as 1100 0011 1010 0100, or 0xC3 0xA4 in hexadecimal.
In Windows-1252 encoding, the same character is represented as 1110 0100 or 0xE4 in hexadecimal.
So let's say you tell a computer to read a file in Windows-1252, but the file is actually encoded as UTF-8. The file contains just one character, say ä. Since the file is in UTF-8, the file actually contains the bits 0xC3 0xA4. Now because you told (implicitly or explicitly) the computer to read the file in Windows-1252, you will actually see ä instead of ä.
An encoding is a means of transforming some sequence of bytes into text. ASCII is one where there's a single byte per character. UTF-8 is another common one that uses a variable number of bytes to encode a somewhat larger set of characters. And, of course, Character Encoding on Wikipedia is probably a helpful read.

Is there a language(s) which will require three or more bytes per character when encoded using UTF-8? Which ones?

Commonly used ofc, Klingon doesnt count :-)
thanks, guys, let me run willItFit() testcases
OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again
Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.
For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:
0000..007F Basic Latin
0080..00FF Latin-1 Supplement
0100..017F Latin Extended-A
0180..024F Latin Extended-B
0250..02AF IPA Extensions
02B0..02FF Spacing Modifier Letters
0300..036F Combining Diacritical Marks
0370..03FF Greek and Coptic
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
0530..058F Armenian
0590..05FF Hebrew
0600..06FF Arabic
0700..074F Syriac
0750..077F Arabic Supplement
0780..07BF Thaana
07C0..07FF NKo
Here we go:
So the first 128 characters (US-ASCII)
need one byte. The next 1,920
characters need two bytes to encode.
This includes Latin letters with
diacritics and characters from Greek,
Cyrillic, Coptic, Armenian, Hebrew,
Arabic, Syriac and Tāna alphabets.
Three bytes are needed for the rest of
the Basic Multilingual Plane (which
contains virtually all characters in
common use). Four bytes are needed for
characters in the other planes of
Unicode, which include less common CJK
characters and various historic
scripts.
More details:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes , Basic Multilingual Plane, Codes from 0x8000.
Some examples: Indic scripts, Thai, Philippine scripts, Hiragana, Katakana. So all East Asia scripts and some other.
You even need three bytes just for English. For example, the typographically correct apostrophe is encoded in UTF-8 as 0xE2 0x80 0x99, opening quote marks are 0xE2 0x80 0x9C and closing quote marks are 0xE2 0x80 0x9D. The ellipsis is 0xE2 0x80 0xA6. And that's not even talking about all the different dashes, spaces or the inch and feet signs.
“It’s kinda hard to write English without the apostrophe’s help …”
There are representations of many Asian languages that use more than 2 bytes. While it's true that they probably don't specifically need to, Japanese and Korean (at least) are often represented in multi-byte form.

What is a multibyte character set?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets?
The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character).
Shift-jis, jis, euc-jp, euc-kr, along with Chinese encodings are typically included.
Most of the legacy encodings, with some exceptions, require a sort of state machine model (or, more simply, a page swapping model) to process, and moving backwards in a text stream is complicated and error-prone. UTF-8 and UTF-16 do not suffer from this problem, as UTF-8 can be tested with a bitmask and UTF-16 can be tested against a range of surrogate pairs, so moving backward and forward in a non-pathological document can be done safely without major complexity.
A few legacy encodings, for languages like Thai and Vietnamese, have some of the complexity of multibyte character sets but are really just built on combining characters, and aren't generally lumped in with the broad term "multibyte."
What is meant if anybody talks about multibyte character sets?
That, as usual, depends on who is doing the talking!
Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).
But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.
Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!
My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.
All character sets where you dont have a 1 byte = 1 character mapping. All Unicode variants, but also asian character sets are multibyte.
For more information, I suggest reading this Wikipedia article.
A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.
References:
IBM: Multibyte Characters
Unicode and MultiByte Character Set (archived), Unicode and Multibyte Character Set (MBCS) Support | Microsoft Docs
Unicode Consortium Website
A multibyte character set may consist of both one-byte and two-byte
characters. Thus a multibyte-character string may contain a mixture of
single-byte and double-byte characters.
Ref: Single-Byte and Multibyte Character Sets
UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.
UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().
UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.
Typically the former, i.e. UTF-8-like. For more info, see Variable-width encoding.
The former - although the term "variable-length encoding" would be more appropriate.
I generally use it to refer to any character that can have more than one byte per character.

Resources