Not quite understanding Endianness - endianness

I understand that 0x12345678 in big endian is 0x12 0x34 0x56 0x78 and 0x78 0x56 0x34 0x12 in little endian.
But what is this needed for? I don't fully understand how it works: it seems deceptively simple.
Is it really as simple as byte order; no other difference?

Your understanding of endianness appears to be correct.
I would like to additionally point out the implicit, conventional nature of endianness and its role in interpreting a byte sequence as some intended value.
0x12345678 in big endian is 0x12 0x34 0x56 0x78 and 0x78 0x56 0x34 0x12 in little endian.
Interestingly, you did not explicitly state what these 0x… entities above are supposed to mean. Most programmers who are familiar with a C-style language are likely to interpret 0x12345678 as a numeric value presented in hexadecimal form, and both 0x12 0x34 0x56 0x78 and 0x78 0x56 0x34 0x12 as byte sequences (where each byte is presented in hexadecimal form, and the left-most byte is located at the lowest memory address). And that is probably exactly what you meant.
Perhaps without even thinking, you have relied on a well-known convention (i.e. the assumption that your target audience will apply the same common knowledge as you would) to convey the meaning of these 0x… entities.
Endianness is very similar to this: a rule that defines for a given computer architecture, data transmission protocol, file format, etc. how to convert between a value and its representation as a byte sequence. Endianness is usually implied: Just as you did not have to explicitly tell us what you meant by 0x12345678, usually it is not necessary to accompany each byte sequence such as 0x12 0x34 0x56 0x78 with explicit instructions how to convert it back to a multi-byte value, because that knowledge (the endianness) is built into, or defined in, a specific computer architecture, file format, data transmission protocol, etc.
As to when endianness is necessary: Basically for all data types whose values don't fit in a single byte. That's because computer memory is conceptually a linear array of slots, each of which has a capacity of 8 bits (an octet, or byte). Values of data types whose representation requires more than 8 bits must therefore be spread out over several slots; and that's where the importance of the byte order comes in.
P.S.: Studying the Unicode character encodings UTF-16 and UTF-8 helped me build a deeper understanding of endianness.
While both encodings are for the exact same kind of data, endianness only plays a role in UTF-16, but not in UTF-8. How can that be?
UTF-16 requires a byte order mark (BOM), while UTF-8 doesn't. Why?
Once you understand the reasons, chances are you'll have a very good understanding of endianness issues.

It appears that your understanding of endianness is just fine.
Since there is more than one possible byte ordering for representing multi-byte data types' values in a linear address space, different CPU / computer manufacturers apparently chose different byte orderings in the past. Thus we have Big and Little Endian today (and perhaps other byte orderings that haven't got their own name).
Wikipedia has a good article on the matter, btw.

Related

What are 7 bit / 8 bit environments for control functions according to ISO/IEC 6429:1992?

I am learning ECMA-48 and I see a lot of notes about 7 bit and 8 bit environments for control functions. For example:
NOTE LS0 is used in 8-bit environments only; in 7-bit environments
SHIFT-IN (SI) is used instead.
As I understand today all environments are 8 bits. If I am wrong could anyone give real examples where 7 bit environments are used.
For example character encodings.
Standard uses values 0x00 to 0x1F, and 0x80 to 0x9F as C0, and C1 control codes. And uses control functions, control sequences, etc. which start from either ESC (0x1B) or CSI (0x9B).
In the 8 bit environment there must be some kind of encoding defined, which specifies which character is represented by which values. The first 128 values will be according to ASCII (or some other standard which is compatible (doesn't use 0x00 to 0x1F as printable characters but reserves them for C0 control codes)) but what about the next 128 values?
Here we enter the world of code pages, which define the upper 128 values. Some existing code pages (like ISO8859-2) reserve the values 0x80 - 0x9F for C1 control codes but some other ones (like CP1250) do not, and use them for printable characters.
When such an encoding is used it is not possible to use the values 0x80 - 0x9F simultaneously for both purposes (printable characters and control codes). So even though there are 8 bits, they are not available for the purposes defined by the standard.
So from the point of view of this standard we treat this as a 7 bit environment and so for example CSI (0x9B) becomes a sequence of 0x1B 0x5B.
"Ok, forget the code pages, we live in the future now. unicode rules".
Ok, with utf-8, the 8 bit encoding for unicode, the story is the same.
Values 0x80 - 0xBF (which includes 0x80 - 0x9F) are in utf-8 treated as the last byte of a character (actually, a code point, but that's irrelevant) encoded by multiple bytes. Again, a conflict.
So if the control functions from the standard have to coexist with utf-8, again 7 bit environment has to be assumed for the purposes of this standard.
(Actually, unicode (so also utf-8) does allow to encode the C1 control codes as valid unicode code points but then they will only work if interpreted by a program which is aware of unicode. Assuming 7 bits removes that requirement)
Your quote uses LS0, SHIFT-IN (SI)
these are thigs defined in the ECMA-35 (ISO 2022) standard are a form of making it possible to encode more characters into the 7 or 8 available bits.
You probably don't have to deal with this part unless you actually want to support these kind of character encodings.

What's the rationale for UTF-8 to store the code point directly?

UTF-8 stores the significant bits of the code point in the low bits of the code units
U+0000-U+007F 0xxxxxxx
U+0080-U+07FF 110xxxxx 10xxxxxx
U+0800-U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000-U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
That requires the decoder to check for over long sequences (like C0 80 instead of 00) and also reduces the number of code points encodable up to a fixed number of bytes. If it uses the same encoding but maps the code points like this
First 128 code points (U+0000—U+007F): 1 byte
Next 2048 code points (U+0080—U+087F): 2 bytes. E.g. C0 81: U+0081
Next 65536 code points (U+0880—U+1087F): 3 bytes. E.g. E0 B0 B1: U+0881
Next 131072 code points (U+10880—U+10FFFF, up to U+20880): 4 bytes. E.g. F0 B0 B0 B1: U+10881
(i.e. the value encodes the offset to the start of the range)
then many more characters can be encoded using a shorter sequence. Decoding is likely also faster, since it needs just an addition with a constant, which is often less costly than a branch to check for overly long code point. In fact 2048 more characters can be squeezed into 3 bytes if we remove the surrogate pair range from the mapping
So why is UTF-8 storing the code points that way?
The rationale is well documented in the "placemat" anecdote which tells about how Ken Thompson and Rob Pike whipped up the spec on the placemat in a restaurant when the Unicode guys (actually somebody from X/Open) contacted them for a review of a draft specification.
http://doc.cat-v.org/bell_labs/utf-8_history contains a narrative by Rob Pike himself, with correspondence between him, Ken Thompson, and the X/Open people. It calls out this desideratum as one of the missing key pieces in the earlier draft:
the ability to
synchronize a byte stream picked up mid-run, with less that one
character being consumed before synchronization
In other words, when you are looking at a byte whose high bit is set, you can tell from that byte value alone whether you are in the middle of a UTF-8 sequence, and if so, how far you need to rewind to get to the start of the multi-byte encoded character.
The full story is well worth a read, so I will just briefly summarize it here. The following is an abridged version of a section of the Wikipedia article's history section.
By early 1992, the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII ...
In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multi-byte sequences would include only bytes where the high bit was set. ...
In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it somewhat less bit-efficient than the previous proposal but crucially allowed it to be self-synchronizing, letting a reader start anywhere and immediately detect byte sequence boundaries. It also abandoned the use of biases and instead added the rule that only the shortest possible encoding is allowed; the additional loss in compactness is relatively insignificant, but readers now have to look out for invalid encodings to avoid reliability and especially security issues. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.

How should a UTF-8 decoder handle invalid codepoints (surrogates, larger than 0x10ffff) correctly?

I'm writing a UTF-8 decoder, and I don't know how to handle invalid codepoints correctly:
surrogates
codepoints larger than 0x10ffff
Suppose, that I'd like to replace invalid codepoints with 0xfffd. Now, how should I replace them? Immediately after I know that the codepoint cannot be valid, or should I decode/consume all bytes that the first byte mandates?
For example, suppose, that the UTF-8 stream contains: 0xf4 0x90 0x80 0x80
These bytes decode to 0x110000, an invalid codepoint.
But, at the second byte, when 0xf4 0x90 is processed, I know, that it cannot be a valid codepoint, no matter what the last two bytes are.
So, should this stream generate one error (and one replacement), or should it generate 3 errors (because 0xf4 0x90 is invalid, and then 0x80 and the other 0x80 is invalid as well)?
Is there a standard which mandates this? If not, what could be the best practice?
I've found an answer in the unicode standard, chapter 03, pages 126-129:
unicode standard mandates that a well-formed subsequence must not be consumed as a part of ill-formed sequence (my example doesn't contain such a case, though)
there is a recommendation to follow W3C: one error should be generated for a maximal subpart of an ill-formed subsequence (see the definition in the linked document)
the second byte of 0xf4 0x90 0x80 0x80 is invalid, so I should generate 4 errors if the recommendation is followed (because the 2nd byte is invalid, the maximal subpart at the beginning is just 0xf4)
If my example have been 0xf4 0x8f 0x41, then I should generate 1 error only, as 0xf4 0x8f is a maximal subpart, and 0x41 is a well-formed subsequence.
The Unicode Consortium seems to be concerned only with accuracy (not dropping good bytes) and security (not putting two pieces of good text together because a security scanner might have given the text a pass when considering the bad bytes but perhaps would have blocked the cleaned up text). It allows others to define any specific practices. (It seems that had proposed best practices but will back away since W3C has formalized them.)
W3C is concerned with security, of course, but also with the consistency you ask for. It says to error (e.g. insert the replacement character) for every ill-formed subsequence, per its very detailed reference UTF-8 Decoder algorithm.

Why there are no 5-byte and 6-byte code points in UTF-8?

Why there are no 5-byte or 6-byte code points? I know they were till 2003 when they were removed. But I cannot find why were they removed.
The Wikipedia page on UTF-8 says
In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.
but I don't understand why it's important.
Because there are no Unicode characters which would require them. And these cannot be added either because they'd be impossible to encode with UTF-16 surrogates.
I’ve heard some reasons, but did’t find any of them convincing. Basically, the stupid reason is: UTF-16 was specified before UTF-8 and at that time, 20 bits of storage for characters (yielding 2²⁰+2¹⁶ caracters minus a little like non-characters and surrogates for management) were deemed enough.
UTF-8 and UTF-16 are already variable-length encodings that, as you said for UTF-8, could be extended without big hastle (use 5- and 6-byte words). Extending UTF-32 to include 21 to 31 bits is trivial (32 could be a problem due to signdness), but making it variable-length defeats the use-case of UTF-32 completely.
Extending UTF-16 is hard, but I’ll try. Look at what UTF-8 does in a 2-byte sequence: The initial 110yyyyy acts like a high surrogate and 10zzzzzz like a low surrogate. For UTF-16, flip it around and re-use high surrogates as “initial surrogates” and low surrogates as “continue surrogates”. So, basically, you can have multiple low surrogates.
There’s a problem, though: Unicode streams are supposed to resist misinterpretation when you’re “tuning in” or the sender is “tuning out”.
In UTF-8, if you read a stream of bytes and it ends with 11100010 10000010, you know for sure the stream is incomplete. 1110 tells you: This is a 3-byte word, but one is still missing. In the suggested “extended UTF-16”, there’s nothing like that.
In UTF-16, if you read a stream of bytes and it ends with a high surrogate, you know for sure the stream is incomplete.
The “tuning out” can be solved by using U+10FFFE as an announcement for a single UTF-32 encoding. If the stream stops after U+10FFFE, you know you’re missing something, same goes for an incomplete UTF-32. And if it stops in the middle of the U+10FFFE, it’s lacking a low surrogate. But that does not work becasue “tuning in” to the UTF-32 encoding can mislead you.
What could be utilized are so-called non-characters (the most well-known would be the reverse of the byte order mark) at the end of plane 16: Encode U+10FFFE and U+10FFFF using existing surrogates to announce a 3- or 4-byte sequence, repectively. This is very wasteful: 32 bits are used for the announcement alone, 48 or 64 additional bits are used for the actual encoding. However, it is still better than, say using U+10FFFE and U+10FFFF around a single UTF-32 encoding.
Maybe there’s something flawed in this reasoning. This is an argument of the sort: This is hard and I’ll prove it by trying and showing where the traps are.
right now the space is allocated for 4^8 + 4^10 code points (CP), i.e. 1,114,112, but barely 1/4 to 1/3rd of that is assigned to anything.
so unless there's a sudden need to add in another 750k CPs in a very short duration, up to 4 bytes for UTF-8 should be more than enough for years to come.
** just personal preference for
4^8 + 4^10
on top of clarity and simplicity, it also clearly delineates the CPs by UTF-8 byte count :
4 ^ 8 = 65,536 = all CPs for 1-, 2-, or 3-bytes UTF-8
4 ^ 10 = 1,048,576 = all CPs for 4-bytes UTF-8
instead of something unseemly like
2^16 * 17
or worse,
32^4 + 16^4
*** unrelated sidenote : *the cleanest formula-triplet I managed to conjure up for the starting points of the UTF-16 surrogates are :: *
4^5 * 54 = 55,296 = 0x D800 = High - surrogates
4^5 * 55 = 56,320 = 0x DC00 = Low - surrogates
4^5 * 56 = 57,344 = 0x E000 = just beyond the upper-boundary of 0x DFFF

Is it true that endianness only affects the memory layout of numbers,but not string?

Is it true that whether the architecture is big or little endian ,only the memory layout of numbers differ,that of the string is the same.
If you have a simple 8-bit character representation (e.g. extended ASCII), then no, endianness does not affect the layout, because each character is one byte.
If you have a multi-byte representation, such as UTF-16, then yes, endianness is still important (see e.g. http://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes).
For strings of 1-byte characters that is correct. For unicode strings (2 bytes/character) there will be a difference.
That's generally not true. Depending on the circumstances, more than one byte might be used for characters, in which case there is a difference between little endian encoding of characters and big endian encoding of characters.
For the most part, but you should understand why. Big vs little endian refers to the ordering of bytes in multi-byte data types like integers. ASCII characters are just a single byte.
Note however that unicode characters are multiple bytes, so the byte order matters. The entire point of unicode is that the single byte in ASCII can only encode 256 different values, which is not enough for all the languages in the world.
Refer here for more informantion about what endianness means:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Data/endian.html

Resources