cisc versus risc - cpu

If you were writing a a textbook, and you needed to decide on a CPU design to talk about certain issues, would you choose RISC or CISC? Pros for RISC, well, you know: cleaner, easier, and so on. Cons for RISC: I would have to use an emulator, few CPUs in market, compared to x86 , and therefore less impact than if I chose a pure RISC CPU like ARM or SPARC. I'm torn, though I am leaning towards x86 and CISC... many pros, cons: not RISC, and all those books like Henessy et al. and others love RISC! So, in summary: would you choose RISC or CISC (x86)?
Thanks.

Are you talking specifically about x86 processor design, or a specific design feature which is more prevalent in x86 than anything else? No? Then go with a RISC architecture.
All that using x86 will achieve is that the x86 mechanics will obscure whatever point it is you're actually trying to make.

Fory general (or programmer) audiences I would suggest x86. No matter what language they are familiar with, the same building blocks like for and if, or the concept of functions can be easily demonstrated with x86 instructions.
Asm is abstract enough, it is priceless if they can connect the new knowledge to their existing understanding of programming. (It may even help them write better quality code).
I don't think choosing x86 means that you have to include all the instructions. The next level can be that you intoduce the machine code representation of some ASM structures, but only when they are confortable with their mnemonics first.

I prefer RISC, because they have fast and small instructions. They're cheaper than CISCs ones and have many more registers. And some instructions of CISC processors, for example intel's xchg or scas, stos, movs and cmps used with rep are so particular than many compilers never use them, so having them is only a waste of silicon and money.
And, talking about registers, even if RISC processors need more instructions to accomplish what a CISC would do with only one, they have so many registers that (I think) memory is accessed less frequently on a RISC.
I'm talking as a CISC "programmer", since I learned on the intel's x86 and sometimes having only 6 general purpose registers available, which are often reduced to 3 or 4 in loops (ecx is gone, with either edi or esi or both), can be very tricky.
Although assembly itself tends to hide the true nature of the code, I think (remember I grew with the x86) that the RISC assembly is more difficult to understand due to the larger amount of instruction needed.

If the main requirement for a project is security, not capacity since security often is a tradeoff for capacity, then I propose that the RISC architecture is used. A firewall that for instance uses x86 assembler was known to have a security issue that RISC implementation like SPARC assembler didn't. So the spec for the most secure project could be to use SPARC assembly since it is known to be a secure environment and even if it wasn't known to be a secure environment there are so few people who can program SPARC assembly compared to x86 assembly so the number of attackers who could program a security breach would be less if what a Firewall is running is a RISC architecture.
So I say that if you are talking about a firewall application, then probably it should be a RISC architecture.

Related

Can I compile Go programs on Xeon Phi (Knight's Landing) processors?

I'm a hobbyist who likes to run my own programs in Go, and as Xeon Phi processors become older they're also becoming extremely cheap. So cheap I can build a dual socket machine from 2015/16 for <$1000
I'm trying to find out if I can run Go programs on these. From what I've seen, this thread says they won't run (and to try gccgo), but it says it won't run because it partially runs on an x87 ISA. Confusingly, in Go release notes they say they're dropping x87 support in 1.16, implying it was supported in the past. I've seen in other threads that all programs will run on the compatibility layer, but that's an extremely slow layer which only has access to a small portion of the cpu's cache.
I feel like I'm moving farther and farther out of my element; I was wondering if someone who's used Xeon Phi knows if it will run Go code? Or just in general, after booting up Ubuntu (or FreeBSD, something that I've seen done and is listed in motherboard specs) what sort of things aren't going to work and what will?
I appreciate any and all help!
You're basing your Knight's Landing worries on this quote about Knight's Corner:
The Knight's Corner processor is based on an x86-64 foundation, yes, but it in fact has its own floating-point instruction set—no x87, no AVX, no SSE, no MMX... Oh, and then you can throw all that away when Knight's Landing (KNL) comes out.
By "throw all that away", they mean all the worries and incompatibilities. KNL is based on Silvermont and is fully x86-64 compatible (including x87, SSE, and SSE2 for both standard ways of doing FP math). It also supports AVX-512F, AVX-512ER, and a few other AVX-512 extensions, along with AVX and AVX2 and SSE up to SSE4.2. A lot like a Skylake-server CPU, except a different set of AVX-512 extensions.
The point of this is exactly to solve the problem you're worried about: so any legacy binary can run on KNL. To get good performance out of it, you want to be running code vectorized with AVX-512 vectors in the loops that do the heavy lifting, but all the surrounding code and other programs in the rest of the Linux distro or whatever can be running ordinary bog-standard code that uses whatever x87 and/or SSE.
Knight's Corner (first-gen commercial Xeon Phi) has its own variant / precursor of AVX-512 in a core based on P5-Pentium, and no other FP hardware.
Knight's Landing (second-gen commercial Xeon Phi) is based on Silvermont, with AVX-512, and is the first that can act as a "host" processor (bootable) instead of just a coprocessor.
This "host" mode is another reason for including enough hardware to decode and execute x87 and SSE: if you're running a whole system on KNL, you're much more likely to want to execute some legacy binaries for non-perf-sensitive tasks, not only binaries compiled specifically for it.
Its x87 performance is not great, though: like one scalar fmul per 2 clocks (https://agner.org/optimize). vs. 2-per-clock SSE mulsd (0.5c recip throughput). Same 0.5c throughput for other SSE/AVX math, including AVX-512 vfma132ps zmm to do 16x single-precision Fused-Multiply-Add operations in one instruction.
So hopefully Go's compiler doesn't use x87 much. The normal way to do scalar math in 64-bit mode (that C compilers and their math libraries use) is SSE, in XMM registers. x86-64 C compilers only use x87 for types like long double.
Yes:
Xeon Phi is a series of x86 manycore processors designed and made by Intel. It is intended for use in supercomputers, servers, and high-end workstations. Its architecture allows use of standard programming languages and application programming interfaces (APIs) such as...
See also https://en.wikipedia.org/wiki/Xeon_Phi
If you can compile go on an x86 processor then you will be able to compile on that specific x86 processor which is manufactured by intel.
Xeon is not Itanium :)
On such systems you would also be able to compile go you would just need to provide a suitable c compiler...
What makes you think you would otherwise not be able to compile go on say... an Atari or perhaps a Arduino?
If you can elaborate on that perhaps I can improve my terrible answer further.

How does a compiler, say gcc, version built years ago can still compile for a processor released recently?

Let's say I take a compiler: gcc 4.8. And processor from intel, let's say skylake or some other fancy new family.
checking this question: How to see which flags -march=native will activate?; if I do gcc -march=native -E -v - </dev/null 2>&1 | grep cc1, this will spout out some flags for the host machine, which is the above processor, skylake.
How does gcc know what flags to enable disable... when 4.8 was released before skylake processors were out? What about other newer family of processors?
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
The question isn't really specific to gcc/intel, I would like to know how others maintain synchronicity between processor and compiler too.
Old compilers don't know how to tune for new microarchitectures. (And are also missing out on better optimization in general: New versions of gcc/clang usually add new optimizations that help across the board, e.g. gcc8 can coalesce loads/stores of multiple adjacent small variables or array elements into a single 4 or 8-byte load or store. This helps on everything.)
They can also only use ISA extensions they know about.
They can make correct code because new x86 CPUs are still x86, and are backwards compatible with code for older CPUs1. Same with ARM. The ARMv8 ISA is backwards compatible with ARMv7, ARMv6, and so on, so new ARM CPUs can run existing ARM binaries. (There are some AArch64 CPUs that dropped support for 32-bit mode, but nevermind that.)
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
Yes, you want your compiler to at least know about your CPU for tuning options.
But yes, always, even when your CPU isn't new. New compiler versions often benefit old CPUs, too, but yes a new set of SIMD extensions to auto-vectorize with can lead to potentially large speedups for code that spends a lot of time in one hot loop. Assuming that loop auto-vectorizes well.
e.g. Phoronix recently posted GCC 5 Through GCC 10 Compiler Benchmarks - Five Years Worth Of C/C++ Compiler Performance where they benchmarked on an i7 5960X (Haswell-E) CPU. I think GCC5 knows about -march=haswell. GCC9.2 makes measurably faster code than even gcc8 on some benchmarks.
But I can pretty much guarantee it's not optimal!! Compilers are good over large scales but there's usually something a human can find in a single hot loop, if they know the low level details of optimizing for a given microarchitecture. It's merely as good as you're going to get from any compiler. (Actually performance regressions exist, so even that's not always true. File a missed-optimization bug if you find one).
-march=native does two separate things
CPU feature detection to enable stuff like -mfma and -mbmi2. This is easy on x86 with the CPUID instruction. GCC will enable all extensions it knows about that are supported by the actual CPU. e.g. I think GCC4.8 was the first GCC to know about any AVX512 extensions, so you might even get some AVX512 auto-vectorization on an Ice Lake or Skylake-avx512. Whether it does a good job or not is another matter, for anything non-trivial. But no AVX512 with GCC4.7.
CPU type detection to set -mtune=skylake. This depends on GCC actually recognizing your specific CPU as something it knows about. If not, it falls back to -mtune=generic. It might detect (with CPUID) your L1/L2/L3 cache sizes and use that to influence some tuning decisions like inlining / unrolling, instead of using a known size for -mtune=haswell. I don't think that's a big deal; current compilers don't AFAIK introduce cache-blocking optimizations to matmul loops or things like that, and that's where knowing cache sizes really matters.
CPU type detection can also use CPUID on x86; the vendor-string and model / family / stepping numbers uniquely identify the microarchitecture. ((wikipedia), sandpile, InstLatx64, https://agner.org/optimize/)
x86 is very much designed to support single binaries that run on multiple microarchitectures and might want do to runtime feature detection / dispatching. So an efficient / portable / extensible CPU detection mechanism exists in the form of the CPUID instruction, introduced in Pentium and some late 486 CPUs. (And thus baseline for x86-64.)
Other ISAs are more often used in embedded uses where code gets recompiled for the specific CPU. They mostly don't have as good support for runtime detection. GCC might have to install a handler for SIGILL and just try running some instructions. Or query the OS which knows what's supported, e.g. Linux's /proc/cpuinfo.
Footnote 1:
For x86 specifically, its main claim to fame / reason for popularity is strict backwards compatibility. A new CPU that fails to run some existing programs would be a lot harder to sell, so vendors don't do that. They'll even bend over backwards to go beyond the on-paper ISA docs to make sure existing code keeps working. As former Intel architect Andy Glew said: All or almost all modern Intel processors are stricter than the manual. (For self-modifying code, and in general).
Modern PC motherboard firmwares even still emulate the legacy hardware of an IBM PC/XT when you boot in legacy BIOS mode, as well as implementing a software ABI for disk, keyboard, and screen access. So even bootloaders and stuff like GRUB have a consistent backwards-compatible interface to use, before they load a kernel which has actual drivers for the real hardware that's actually present.
A modern PC can I think still run real MS-DOS (the operating system) binaries in 16-bit real mode.
Adding new instruction opcodes without breaking backwards compat makes variable-length x86 machine code instructions ever more complex, and careless / anti-competitive developments in x86's history haven't helped, leading to more bloated instruction encodings for SSSE3 and later, for example. See Agner Fog's article Stop the instruction set war.
Code that depended on rep foo to decode as foo can break, though: Intel's manuals are pretty clear that random prefixes can cause code to misbehave in future. This makes it safe for Intel or AMD to introduce new instructions that decode in a known way on old CPUs, but do something new on newer CPUs. Like pause = rep nop. Or transactional memory HLE uses prefixes on locked instructions that old CPUs will ignore.
And prefixes like VEX (AVX) and EVEX (AVX512) are carefully chosen to not overlap with valid encodings of instructions, especially in 32-bit mode. See How does the instruction decoder differentiate between EVEX prefix and BOUND opcode in 32-bit mode?. This is one reason why 32-bit mode can still only use 8 vector registers (zmm0..7) even with VEX or EVEX which allow ymm0..15 or zmm0..31 respectively in 64-bit mode. (In 32-bit mode, a VEX prefix is invalid encodings of some opcode. In 64-bit mode, that opcode isn't valid in the first place to the later bytes are more flexible. But to simplify decoder HW they aren't fundamentally different.)
MIPS32r6 / MIPS64r6 in 2014 is one notable example that's not backwards compatible. It rearranged a few opcodes for instructions that stayed the same, and removed some instructions to reuse their opcode for other new instructions, e.g. branches without a delay slot. This is highly unusual and only makes sense for CPUs that are used for embedded systems (like current MIPS). Recompiling everything for MIPS32r6 is not a problem for an embedded system.
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but still of course only for extensions that the compiler knows about when it compiles. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code.
And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
GCC has some support for this, via its ifunc mechanism, but IIRC you can't do that without source changes.
Intel's compiler (ICC) I think does support multi-versioning some hot functions when auto-vectorizing, with just command-line options.
It can only happen if the new processor is specifically designed to be backwards compatible with older models.
Forget gcc for a moment. You have a compiled X86 binary from year 2000, say, an executable built for the original Windows NT. Will a Skylake CPU run it? You betcha. Will an Itanium CPU run iit? Nope, it is not designed to do that. It is a completely different architecture
Now that executable most probably wouldn't use the Skylake efficiently, but that's the whole point of evolving architectures and introducing new instructions.
Returning to gcc, -march=native is not magic. It cannot possibly divine out the new instructions and new timings. It simply selects the "best" instruction set it knows that is supported by the CPU it runs on. How it's done is architecture specific. X86 CPUs can be queried about their capabilities with the the CPUID instruction. Other architectures may do it differently.
To put it another way, -O3 -march=native optimizes for the machine you compiled on, so it's good when you're compiling code to run on the build host. A binary built with -march=native on a Nehalem system is essentially the same as one built with -march=nehalem on any system. -march=native might detect your specific L3 cache size instead of using a default for that, if any GCC tuning decisions (like inlining or unrolling) depend on L3 size. Except if you run an old compiler on a new CPU it doesn't recognize, you get feature detection for stuff like -mavx but for tuning only tune=generic.
None of this can take advantage of new features like AVX2 or BMI2 when running on a Skylake or Ice Lake system. And some specific tuning decisions that were good on Nehalem might be sub-optimal on a different CPU. (Although this is less likely; Intel mostly maintains backwards compatibility for performance as well as correctness. Getting everyone to recompile everything for P4 didn't work out so they usually try to make existing binaries run well on new CPUs.)
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but only for extensions that the compiler knows about when it compiled. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code. And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.

Why isn't the distinction between CPUs more ubiquitous?

I know that every program one writes has to eventually boil down to machine code - that's what compilers produce, that's what executable files consist of, and that's the only language that processors understand. I also know that different processors may have different instruction sets (I know 65c816 assembly, and I imagine it's vastly different from today's computers).
Here's what I'm not getting, though: If there exist different instruction sets, then why do we not seem to have to care about that every time we use software?
If a program was compiled for one particular CPU, it might not run on another - and yet, I never see notices like "Intel users download this version, AMD users download this one". I never have to even be aware of what CPU I'm on, every executable just seems to... work. The same goes for compilers, apparently - there isn't a separate version of, say, GCC, for every processor there is, right?
I'm aware that the differences in instruction sets are much more subtle than they used to be, but even then there should at least be a bit of a distinction. I'm wondering why there doesn't seem to be any.
What is it that I'm not understanding?
There actually are sometimes different versions for Intel/AMD. Even for different versions of Intel and/or AMD. That's not common (especially in the kind of software people usually use) because it's not user friendly (most people don't even know what a CPU is or does, let alone what kind they have exactly), so what often is that either the multiple versions are all in the same executable and selected at runtime, or a "least common denominator" sub-set of instructions is used (when performance is not really a concern). There are significant differences between AMD and Intel though, the most significant one is in which instruction sets they support. AMD always lags behind Intel in implementing Intels new instructions (new sets come out regularly), and Intel usually does not implement AMDs extensions (AMD64 is the big exception (99% accepted by Intel, small changes made), also a couple of instructions here and there were borrowed, but XOP will never happen even though it's awesome, 3DNow! was never adopted by Intel). As an example of software that does not package the code for different "extended instruction sets" in the same executable, see y-cruncher.
To get back to the beginning though, for some (I can't name any off the top of my head, but I've seen it before) high performance software you may see different versions each specifically tailored to get maximum performance on one specific microarchitecture. For example, P4 (netburst) and Core2 are two very different beasts (that's mostly P4's fault for being crazy), even though Core2 is backwards compatible and could run the same code unmodified, that is not necessarily a good idea from a performance perspective.
There is no Intel/AMD versions, because they use the same IS family: x86.
However, there are applications where you have to look out for different versions when you download them. There are still instruction sets that are quite different and might make a program act differently. For example if you have a PowerPC architecture and code a network based application on it, you can forget the little to big endian conversion, but if you compile the same code on x86, which is little endian, the application most likely will produce garbage on the network side.
There is also the difference in how many instructions there are, e.g. RISC vs CISC.
In the end there are a lot of differences to look for and in most programming languages you don't have to worry too much about them though, as the compiler/interpreter will handle most things for you. If you work lower lever then you have to know what you're doing on each architecture.
Also if you compile for ARM, you won't be able to run the program on any other machine, like your PC with x86. It will not work at all.
Because the op codes may/do differ, take the mov instruction, on x86 the op code is 0x88, on ARM it might be 0x13 etc.
The distinction is in fact quite dramatic. Except in the case of Intel vs. AMD. AMD makes their processors compatible with Intel machine code. On purpose of course.
Today there is a move to JIT compiling (Java, .NET, etc.). In this case, the executable file doesn't contain machine code. It contains a simple intermediate language that is compiled just before it is executed, in the machine code of the running processor. This allows the processor architecture to be completely opaque.
AMD is an intel clone. or vice versa depending on your view of the situation. Either way, there is so much in common that programs can be compiled as to run on either (within reason, cant go back 10 years for example, cant make a 32bit processor understand 64 bit specific instructions). Next step is the motherboards have to do similar things and they do, there maybe some intel or amd specific chip support items but then you get into generic peripherals that can be found on either platform or are widespread enough on one or the other platform that the operating systems and/or applications support them.

Is there performance advantage to ARM64

Recently 64-bit ARM mobiles started appearing. But is there any practical advantage to building an application 64-bit? Specifically considering application that does not have much use for the increased virtual address space¹, but would waste some space due to increased pointer size.
So does ARM64 have any other advantages than the larger address that would actually warrant building such application 64bit?
Note: I've seen 64-bit Performance Advantages, but it only mentions x86-64 which does have other improvements besides extended virtual address space. I also recall that the situation is indeed specific to x86 and on some other platforms that went 64-bit like Sparc the usual approach was to only compile kernel and the applications that actually did use lot of memory as 64-bit and everything else as 32-bit.
¹The application is multi-platform and it still needs to be built for and run on devices with as little as 48MiB of memory. Does have some large data that it reads from external storage, but it never needs more than some megabytes of it at once.
I am not sure a general response can be given, but I can provide some examples of differences. There are of course additional differences added in version 8 of the ARM architecture, which apply regardless of target instruction set.
Performance-positive additions in AArch64
32 General-purpose registers gives compilers more wiggle room.
I/D cache synchronization mechanisms accessible from user mode (no system call needed).
Load/Store-Pair instructions makes it possible to load 128-bits of data with one instruction, and still remain RISC-like.
The removal of near-universal conditional execution makes more out-of-ordering possible.
The change in layout of NEON registers (D0 is still lower half of Q0, but D1 is now lower half of Q1 rather than upper half of Q0) makes more out-of-ordering possible.
64-bit pointers make pointer tagging possible.
CSEL enables all kind of crazy optimizations.
Performance-negative changes in AArch64
More registers may also mean higher pressure on the stack.
Larger pointers mean larger memory footprint.
Removal of near-universal conditional execution may cause higher pressure on branch predictor.
Removal of load/store-multiple means more instructions needed for function entry/exit.
Performance-relevant changes in ARMv8-A
Load-Aquire/Store-Release semantics remove need for explicit memory barriers for basic synchronization operations.
I probably forgot lots of things, but those are some of the more obvious changes.

Why does the Mac ABI require 16-byte stack alignment for x86-32?

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?
Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:
Consistency between the PPC, x86, and x64 version of the OS
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.
The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.
Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.
I'll give this topic another day or so then close it...
Related
It’s my stack frame, I don’t care about your stack frame!
From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:
"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."
From Appendix D:
"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."
http://www.intel.com/Assets/PDF/manual/248966.pdf
I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...
I believe it's to keep it inline with the x86-64 ABI.
First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.
The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.
You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.
Edit:
For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.
This is an efficiency issue.
Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.
On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.
Wasting stack space is cheap, it is probably the hottest part of the cache.
My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.
While I cannot really answer your question of WHY, you may find the manuals at the following site useful:
http://www.agner.org/optimize/
Regarding the ABI, have a look especially at:
http://www.agner.org/optimize/calling_conventions.pdf
Hope that's useful.
Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?
So that points to the consistency with other platforms theory.
Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)
In order to maintain consistency in kernel. This allows the same kernel to be booted on multiple architectures without modicfication.
Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?
Read this:
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20
And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:
"These are the embedding alignment modes available in the 32-bit
PowerPC environment:
Power alignment mode is derived from the alignment rules used by the
IBM XLC compiler for the AIX operating system. It is the default
alignment mode for the PowerPC-architecture version of GCC used on AIX
and Mac OS X. Because this mode is most likely to be compatible
between PowerPC-architecture compilers from different vendors, it’s
typically used with data structures that are shared between different
programs."
In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler. When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.
That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):
The called function is responsible for allocating
its own stack frame, making sure to preserve 16-byte alignment in the
stack. This operation is accomplished by a section of code called the
prolog, which the compiler places before the body of the subroutine.
After the body of the subroutine, the compiler places an epilog to
restore the processor to the state it was prior to the subroutine
call.

Resources