I have simple question.
As we know that Char takes two bytes (16 bit), and Byte takes one byte(8 bits).
But in many programming languages there is a function that converts Char to Byte. How it is possible to convert Char to Byte without losing anything?
In C# and java char is a 16 bit Unicode character. In other (older?) languages (C, C++, etc.) chars are 8 bit representations of ASCII characters. In those languages it makes sense to convert the types without losing anything.
In C# you can convert chars to twice as many bytes, or assume (be sure really) that the chars that you are trying to convert are 8 bit chars (look at the ASCII table) written as Unicode chars.
It's not possible. I don't think there are any language shipped with such a function. For example, Java has String.getBytes, C# has Encoding.GetBytes, what they get are bytes, not byte. This kind of conversion is just like cast from short[] to char[] in C, very simple, no manipulation, just casting, the size of the whole object (the total number the bytes) remains the same; nothing is lost.
Related
From the Char library documentation, I see that chars are able to represent at least the ISO/IEC 8859-1 character set, a character set that uses 8 bits per character. Do OCaml chars represent exactly 8 bits, no more and no less? Where is this documented?
The document says this:
Character values are represented as 8-bit integers between 0 and 255. Character codes between 0 and 127 are interpreted following the ASCII standard. The current implementation interprets character codes between 128 and 255 following the ISO 8859-1 standard.
So yes, an OCaml char represents exactly 8 bits.
The documentation for base values is here: OCaml Manual, Chapter 9.2. Values.
Update
It might be worth noting that although a char value in OCaml can take on only values from 0 to 255, in the mainline OCaml version (from INRIA) the actual space occupied in memory by a char value is the same as for int. On a 32-bit implementation this will be 32 bits and on a 64-bit implementation it will be 64 bits. So (for example) a char array is not a space-efficient way to store more than a few chars. You can use string or bytes to get compact storage of char values (as 8 bits each).
The documentation for representation of OCaml values is here: OCaml Manual, Chapter 20.3, Representation of OCaml Data Types.
The representation of the char type could be different depending on the implementation of the OCaml language and runtime. While all chars shall fit into 8 bits, an implementation may use a bigger type to represent it. The Char abstraction guarantees that it is impossible to create a character that uses more than 8 bits. And even though the INRIA implementation of OCaml represents Char.t the same as Int.t, it still relies on the assumption that char will fit into 8 bits. For example, a bigarray of n chars will take n bytes. And String.t will have a size in bytes proportional to the number of characters that comprise the string. Last but not least, various external (i.e., implemented in C) functions and the optimized compiler itself will assume that a character fits into 8 bits.
I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.
What are surrogate characters in UTF-8?
This is almost like a trick question.
Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).
Approximate answer #2: Invalid (if not paired).
Approximate answer #3: It's not UTF-8; It's Modified UTF-8.
Synopsis: The term doesn't apply to UTF-8.
Unicode codepoints have a range that needs 21 bits of data.
UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.
UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.
#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.
#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).
#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.
Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.
I'm still confused about the bits and bytes although I've been searching through the internet. Is that one character of ASCII = 1 bytes = 8 bits? So 8 bits have 256 unique pattern that covered all the ASCII code, what form is it stored in our computer?
And if I typed "Hello" does that mean this consists of 5 bytes?
Yes to everything you wrote. "Bit" is a binary digit: a 0 or a 1. Historically there existed bytes of smaller sizes; now "byte" only ever means "8 bits of information", or a number between 0 and 255.
No. ASCII is a character set with 128 codepoints stored as the values 0-127. Modern computers predominantly address 8-bit memory and disk locations so a 7-bit ASCII value takes up 8 bits.
There is no text but encoded text. An encoding maps a member of a character set to one or more bytes. Unless you absolutely know you are using ASCII, you probably aren't. There are quite a few character sets with encodings that cover all 256 byte values and use any combination of byte values to encode a string.
There are several character sets that are similar but have a few less than 256 characters. And others that use more than one byte to encode a codepoint and don't use every combination of byte values.
Just so you know, Unicode is the predominant character set except in very specialized situations. It has several encodings. UTF-8 is often used for storage and streams. UTF-16 is often used in memory, particularly in Java, .NET, JavaScript, XML, …. When text is communicated between systems, there has to be an agreement, specification, standard, or indication about which character set and encoding it uses so a sequence of bytes can be interpreted as characters.
To add to the confusion, programming languages have data types called char, Character, etc. You have to look at the specific language's reference manual to see what they mean. For example in C, char is simply an integer that is defined as the size of the encoding of character used by that C implementation. (C also calls this a "byte" and it is not necessarily 8 bits. In all other contexts, people mean 8 bits when they say "byte". If they want to be exceedingly unambiguous they might say "octet".)
"Hello" is five characters. In a specific character set, it is five codepoints. In a specific encoding for that character set, it could be 5, 10 or 20, or ??? bytes.
Also, in the source code of a specific language, a literal string like that might be "null-terminated". This means that you could say it is 6 "characters". Other languages might store a string as a counted sequence of code units. Again, you have to look at the language reference to know the underlying data structure of strings. Of, if the language and the libraries used with it are sufficiently high-level, you might never need to know such internals.
What is the maximum number of bytes for a single UTF-8 encoded character?
I'll be encrypting the bytes of a String encoded in UTF-8 and therefore need to be able to work out the maximum number of bytes for a UTF-8 encoded String.
Could someone confirm the maximum number of bytes for a single UTF-8 encoded character please
The maximum number of bytes per character is 4 according to RFC3629 which limited the character table to U+10FFFF:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets.
(The original specification allowed for up to six byte character codes for code points past U+10FFFF.)
Characters with a code less than 128 will require 1 byte only, and the next 1920 character codes require 2 bytes only. Unless you are working with an esoteric language, multiplying the character count by 4 will be a significant overestimation.
Without further context, I would say that the maximum number of bytes for a character in UTF-8 is
answer: 6 bytes
The author of the accepted answer correctly pointed this out as the "original specification". That was valid through RFC-2279 1. As J. Cocoe pointed out in the comments below, this changed in 2003 with RFC-3629 2, which limits UTF-8 to encoding for 21 bits, which can be handled with the encoding scheme using four bytes.
answer if covering all unicode: 4 bytes
But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane (BMP), i.e. it is an older version of unicode, or subset of modern unicode. So
answer if representing only original unicode, the BMP: 3 bytes
But, the OP talks about going the other way. Not from characters to UTF-8 bytes, but from UTF-8 bytes to a "String" of bytes representation. Perhaps the author of the accepted answer got that from the context of the question, but this is not necessarily obvious, so may confuse the casual reader of this question.
Going from UTF-8 to native encoding, we have to look at how the "String" is implemented. Some languages, like Python >= 3 will represent each character with integer code points, which allows for 4 bytes per character = 32 bits to cover the 21 we need for unicode, with some waste. Why not exactly 21 bits? Because things are faster when they are byte-aligned. Some languages like Python <= 2 and Java represent characters using a UTF-16 encoding, which means that they have to use surrogate pairs to represent extended unicode (not BMP). Either way that's still 4 bytes maximum.
answer if going UTF-8 -> native encoding: 4 bytes
So, final conclusion, 4 is the most common right answer, so we got it right. But, mileage could vary.
The maximum number of bytes to support US-ASCII, a standard English alphabet encoding, is 1. But limiting text to English is becoming less desirable or practical as time goes by.
Unicode was designed to represent the glyphs of all human languages, as well as many kinds of symbols, with a variety of rendering characteristics. UTF-8 is an efficient encoding for Unicode, although still biased toward English. UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction.
While the maximum number of bytes per UTF-8 character is 3 for supporting just the 2-byte address space of Plane 0, the Basic Multilingual Plane (BMP), which can be accepted as minimal support in some applications, it is 4 for supporting all 17 current planes of Unicode (as of 2019). It should be noted that many popular "emoji" characters are likely to be located in Plane 16, which requires 4 bytes.
However, this is just for basic character glyphs. There are also various modifiers, such as making accents appear over the previous character, and it is also possible to link together an arbitrary number of code points to construct one complex "grapheme". In real world programming, therefore, the use or assumption of a fixed maximum number of bytes per character will likely eventually result in a problem for your application.
These considerations imply that UTF-8 character strings should not "expanded" into arrays of fixed length prior to processing, as has sometimes been done. Instead, programming should be done directly, using string functions specifically designed for UTF-8.
Condidering just technical limitations - it's possible to have up to 7 bytes following current UTF8 encoding scheme. According to it - if first byte is not self-sufficient ASCII character, than it should have pattern: 1(n)0X(7-n), where n is <= 7.
Also theoretically it could be 8 but then first byte would have no zero bit at all. While other aspects, like continuation byte differing from leading, are still there (allowing error detection), I heared, that byte 11111111 could be invalid, but I can't be sure about that.
Limitatation for max 4 bytes is most likely for compatibility with UTF-16, which I tend to consider a legacy, because the only quality where it excels, is processing speed, but only if string byte order matches (i.e. we read 0xFEFF in the BOM).
Is it true that whether the architecture is big or little endian ,only the memory layout of numbers differ,that of the string is the same.
If you have a simple 8-bit character representation (e.g. extended ASCII), then no, endianness does not affect the layout, because each character is one byte.
If you have a multi-byte representation, such as UTF-16, then yes, endianness is still important (see e.g. http://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes).
For strings of 1-byte characters that is correct. For unicode strings (2 bytes/character) there will be a difference.
That's generally not true. Depending on the circumstances, more than one byte might be used for characters, in which case there is a difference between little endian encoding of characters and big endian encoding of characters.
For the most part, but you should understand why. Big vs little endian refers to the ordering of bytes in multi-byte data types like integers. ASCII characters are just a single byte.
Note however that unicode characters are multiple bytes, so the byte order matters. The entire point of unicode is that the single byte in ASCII can only encode 256 different values, which is not enough for all the languages in the world.
Refer here for more informantion about what endianness means:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Data/endian.html