The reason behind endianness? - endianness

I was wondering, why some architectures use little-endian and others big-endian. I remember I read somewhere that it has to do with performance, however, I don't understand how can endianness influence it. Also I know that:
The little-endian system has the property that the same value can be read from memory at different lengths without using different addresses.
Which seems a nice feature, but, even so, many systems use big-endian, which probably means big-endian has some advantages too (if so, which?).
I'm sure there's more to it, most probably digging down to the hardware level. Would love to know the details.

I've looked around the net a bit for more information on this question and there is a quite a range of answers and reasonings to explain why big or little endian ordering may be preferable. I'll do my best to explain here what I found:
Little-endian
The obvious advantage to little-endianness is what you mentioned already in your question... the fact that a given number can be read as a number of a varying number of bits from the same memory address. As the Wikipedia article on the topic states:
Although this little-endian property is rarely used directly by high-level programmers, it is often employed by code optimizers as well as by assembly language programmers.
Because of this, mathematical functions involving multiple precisions are easier to write because the byte significance will always correspond to the memory address, whereas with big-endian numbers this is not the case. This seems to be the argument for little-endianness that is quoted over and over again... because of its prevalence I would have to assume that the benefits of this ordering are relatively significant.
Another interesting explanation that I found concerns addition and subtraction. When adding or subtracting multi-byte numbers, the least significant byte must be fetched first to see if there is a carryover to more significant bytes. Because the least-significant byte is read first in little-endian numbers, the system can parallelize and begin calculation on this byte while fetching the following byte(s).
Big-endian
Going back to the Wikipedia article, the stated advantage of big-endian numbers is that the size of the number can be more easily estimated because the most significant digit comes first. Related to this fact is that it is simple to tell whether a number is positive or negative by simply examining the bit at offset 0 in the lowest order byte.
What is also stated when discussing the benefits of big-endianness is that the binary digits are ordered as most people order base-10 digits. This is advantageous performance-wise when converting from binary to decimal.
While all these arguments are interesting (at least I think so), their applicability to modern processors is another matter. In particular, the addition/subtraction argument was most valid on 8 bit systems...
For my money, little-endianness seems to make the most sense and is by far the most common when looking at all the devices which use it. I think that the reason why big-endianness is still used, is more for reasons of legacy than performance. Perhaps at one time the designers of a given architecture decided that big-endianness was preferable to little-endianness, and as the architecture evolved over the years the endianness stayed the same.
The parallel I draw here is with JPEG (which is big-endian). JPEG is big-endian format, despite the fact that virtually all the machines that consume it are little-endian. While one can ask what are the benefits to JPEG being big-endian, I would venture out and say that for all intents and purposes the performance arguments mentioned above don't make a shred of difference. The fact is that JPEG was designed that way, and so long as it remains in use, that way it shall stay.

I would assume that it once were the hardware designers of the first processors who decided which endianness would best integrate with their preferred/existing/planned micro-architecture for the chips they were developing from scratch.
Once established, and for compatibility reasons, the endianness was more or less carried on to later generations of hardware; which would support the 'legacy' argument for why still both kinds exist today.

Related

Go: pointer bit stealing technique

In a well-known book The Art of Multiprocessor Programming by Herlihy, Shavit some of lock-free and wait-free algorithms utilize Java's template AtomicMarkableReference<T> type. It allows to perform single atomic CAS operation on the pair consisting of T reference and boolean mark.
There is no similar type in C/C++/Go stdlib, but at least in C++ it's possible to model it using bit stealing approach (see C++ example). On x86_64 arch only 48 bits of 64 bits are actually used, so one can store arbitrary data in the remaining 16 bits, and work with the whole pointer and the data atomically.
As far as I understand, there are two requirements to implement this approach:
Pointers must be aligned.
Pointer's low bits must be clear (if you want to store something like bool in this area, it must not be already occupied).
Are these requirements met in Go? Are there any working examples of bit stealing technique in Go?

Are there any advantages to network byte order in a new protocol?

(I know many people are going to be tempted to close this question; please don't; I'm asking for concrete technical answers, if any exist.)
"Network byte order" is big-endian for reasons that cannot be asked on stackoverflow. Lots of old protocols use that order and can't be changed but I wonder if there are any technical reasons to choose big endian for a new protocol.
I would think little endian is better, because 99.99% of processors in use are little endian (ARM can technically do both, but in reality it is always set to little endian). So I was surprised to see that CBOR, a relatively recent protocol, uses big endian. Is there an advantage that I haven't thought of?
It boils down to human factors: It is easier to read a multi-byte integer in a hex dump if it is encoded with the most significant byte(s) first. For example, the CBOR representation of 0x1234 (4,660) is the byte sequence 19 12 34. If you are looking for the value 0x1234, it is easier to spot it that way.
TLDR;
I've been in the field for over 40 years now, so there's a lot of history behind this. Even the definition of a "byte" has changed over that many years, so this may take a bit of an open mind to understand how this evolved.
Dumps of binary information weren't always in bytes, nor hexadecimal. For example, the PDP-11 (with 16-bit words, and 8-bit bytes) the use of octal notation word-wide dumps was common. This was useful because of the machine architecture, which inculuded 8 registers, and 8 addressing modes, so machine langugage dumps in octal were easier to decode than hex.

how does ECC for conventional [cyclic] burst error correction work?

How does ECC for burst error correction work?
By "burst error detection", I mean a technique that can detect (for example) any combination of bit errors within any one [or two] sequences of 64 consecutive bits.
I need a conceptual explanation, not math.
I studied several descriptions of the technique that are formulated in terms of endless math symbols, but I do not understand what they are saying (because I am not fluent those advanced math formulations).
I ask this question because I dreamed up a technique to detect one burst of 64-bits in a 4096-byte (32768-bit) data-stream (disk-sector/transmission/etc), and want someone to explain the following:
#1: whether my approach is different or equivalent to "cyclic error codes".
#2: how much less efficient is my technique (640-bits corrects any 64-bit burst in 32768-bit stream).
#3: whether anyone can see a way to make my approach detect two bursts instead of only one.
#4: whether my approach is substantially simpler for software implementation.
I posted a conceptual explanation of my technique recently, but several folks were annoyed by my detailed explanation and closed the question. However, it demonstrates that [at least] my technique can be described in conceptual terms, as hopelly someone can for conventional techniques, (cyclic codes). You will also need to read my explanation (to compare it with conventional techniques) at this page:
how does ECC for burst error correction work?
I don't have a full answer to your question, but years ago I implemented cyclic ECC in a hard disk controller, and that design was clearly not as simple, regular and structured as your technique.
Though the hardware to implement my cyclic ECC was fairly simple to implement in hardware, I would be hard pressed to figure out how to re-formulate it in software! In hardware it was a long shift register (roughly 32 to 64-bits if memory serves) with XOR gates inserted at roughly 15 locations to conditionally flip bits based upon bits elsewhere in the stream.
While my implementation could not detect bursts as long as 64-bits (only about 11 or 13-bits as I recall), my impression is that your technique requires two or three times as many bits as optimal cyclic ECC techniques would for similar burst-length and data-stream length.
However, the overhead of your technique is probably small enough to be insignificant. Furthermore, looking at your scheme makes me think (but not be certain) you can correct far more errors than just the "one burst" you designed it for. So your technique may be more robust than conventional cyclic ECC, but more complex software processing would be necessary to "locate and correct" errors not within the one 64-bit burst.
Also on the positive side, your technique can clearly be implemented in hardware easily and efficiently.

Safe mixing of entropy sources

Let us assume we're generating very large (e.g. 128 or 256bit) numbers to serve as keys for a block cipher.
Let us further assume that we wear tinfoil hats (at least when outside).
Being so paranoid, we want to be sure of our available entropy, but we don't entirely trust any particular source. Maybe the government is rigging our coins. Maybe these dice are ever so subtly weighted. What if the hardware interrupts feeding into /dev/random are just a little too consistent? (Besides being paranoid, we're lazy enough that we don't want to generate it all by hand...)
So, let's mix them all up.
What are the secure method(s) for doing this? Presumably just concatenating a few bytes from each source isn't entirely secure -- if one of the sources is biased, it might, in theory, lend itself to such things as a related-key attack, for example.
Is running SHA-256 over the concatenated bytes sufficient?
(And yes, at some point soon I am going to pick up a copy of Cryptography Engineering. :))
Since you mention /dev/random -- on Linux at least, /dev/random is fed by an algorithm that does very much what you're describing. It takes several variously-trusted entropy sources and mixes them into an "entropy pool" using a polynomial function -- for each new byte of entropy that comes in, it's xor'd into the pool, and then the entire pool is stirred with the mixing function. When it's desired to get some randomness out of the pool, the entire pool is hashed with SHA-1 to get the output, then the pool is mixed again (and actually there's some more hashing, folding, and mutilating going on to make sure that reversing the process is about as hard as reversing SHA-1). At the same time, there's a bunch of accounting going on -- each time some entropy is added to the pool, an estimate of the number of bits of entropy it's worth is added to the account, and each time some bytes are extracted from the pool, that number is subtracted, and the random device will block (waiting on more external entropy) if the account would go below zero. Of course, if you use the "urandom" device, the blocking doesn't happen and the pool simply keeps getting hashed and mixed to produce more bytes, which turns it into a PRNG instead of an RNG.
Anyway... it's actually pretty interesting and pretty well commented -- you might want to study it. drivers/char/random.c in the linux-2.6 tree.
Using a hash function is a good approach - just make sure you underestimate the amount of entropy each source contributes, so that if you are right about one or more of them being less than totally random, you haven't weakened your key unduly.
This isn't dissimilar to the approach used in key stretching (though you have no need for multiple iterations here).
I've done this before, and my approach was just to XOR them, byte-by-byte, against each other.
Running them through some other algorithm, like SHA-256, is terribly inefficient, so it's not practical, and I think it would be not really useful and possibly harmful.
If you do happen to be incredibly paranoid, and have a tiny bit of money, it might be fun to buy a "true" (depending on how convinced you are by Quantum Mechanics) a Quantum Random Number Generator.
-- Edit:
FWIW, I think the method I describe above (or something similar) is effectively a One-Time Pad from the point of view of either sources, assuming one of them is random, and therefore unattackable assuming they are independant and out to get you. I'm happy to be corrected on this if someone takes issue with it, and I encourage anyone not taking issue with it to question it anyway, and find out for yourself.
If you have a source of randomness but you're not sure whether it is biased or not, then there are a lot of different algorithms. Depending on how much work you want to do, the entropy you waste from the original source differes.
The easiest algorithm is the (improved) van Neumann algorithm. You can find the details in this pdf:
http://security1.win.tue.nl/~bskoric/physsec/files/PhysSec_LectureNotes.pdf
at page 27.
I also recommend you to read this document if you're interested in how to produce uniformly randomness from a given souce, how true random number generators work, etc!

Is there really such a thing as a char or short in modern programming?

I've been learning to program for a Mac over the past few months (I have experience in other languages). Obviously that has meant learning the Objective C language and thus the plainer C it is predicated on. So I have stumbles on this quote, which refers to the C/C++ language in general, not just the Mac platform.
With C and C++ prefer use of int over
char and short. The main reason behind
this is that C and C++ perform
arithmetic operations and parameter
passing at integer level, If you have
an integer value that can fit in a
byte, you should still consider using
an int to hold the number. If you use
a char, the compiler will first
convert the values into integer,
perform the operations and then
convert back the result to char.
So my question, is this the case in the Mac Desktop and IPhone OS environments? I understand when talking about theses environments we're actually talking about 3-4 different architectures (PPC, i386, Arm and the A4 Arm variant) so there may not be a single answer.
Nevertheless does the general principle hold that in modern 32 bit / 64 bit systems using 1-2 byte variables that don't align with the machine's natural 4 byte words doesn't provide much of the efficiency we may expect.
For instance, a plain old C-Array of 100,000 chars is smaller than the same 100,000 ints by a factor of four, but if during an enumeration, reading out each index involves a cast/boxing/unboxing of sorts, will we see overall lower 'performance' despite the saved memory overhead?
The processor is very very fast compared to the memory speed. It will always pay to store values in memory as chars or shorts (though to avoid porting problems you should use int8_t and int16_t). Less cache will be used, and there will be fewer memory accesses.
Can't speak for PPC/Arm/A4Arm, but x86 has the ability to operate on data as if it was 8bit, 16bit, or 32bit (64bit if an x86_64 in 64bit mode), although I'm not sure if the compiler would take advantage of those instructions. Even when using 32bit load, the compiler could AND the data with a mask that'd clear the upper 16/24bits, which would be relatively fast.
Likely, the ability to fit far more data into the cache would at least cancel out the speed difference... although the only way to know for sure would be to actually profile the code.
Of course there is a need to use data structures less than the register size of the target machine. Imagine your are storing text data encoded as UTF-8, or ASCII in memory where each character is mostly like a byte in size, do you want to store the characters as 64 bit quantities?
The advice you are looking is a warning not to over optimizes.
You have to balance the savings in space versus the computation performance of you choice.
I wouldn't worry to much about it, today's modern CPUs are complicated enough that its hard to make this kind of judgement on your own. Choose the obvious datatype and let the compiler worry about the rest.
The addressing model of the x86 architecture is that the basic unit of memory is 8 bit bytes.
This is to simplify operation with character strings and decimal arithmetic.
Then, in order to have useful sizes of integers, the instruction set allows using these in units of 1, 2, 4, and (recently) 8 bytes.
A Fact to remember, is that most software development takes place writing for different processors than most of us here deal with on a day to day basis.
C and assembler are common languages for these.
About ten billion CPUs were manufactured in 2008. About 98% of new CPUs produced each year are embedded.

Resources