Why isn't the distinction between CPUs more ubiquitous?

Why isn't the distinction between CPUs more ubiquitous? - cpu

I know that every program one writes has to eventually boil down to machine code - that's what compilers produce, that's what executable files consist of, and that's the only language that processors understand. I also know that different processors may have different instruction sets (I know 65c816 assembly, and I imagine it's vastly different from today's computers).
Here's what I'm not getting, though: If there exist different instruction sets, then why do we not seem to have to care about that every time we use software?
If a program was compiled for one particular CPU, it might not run on another - and yet, I never see notices like "Intel users download this version, AMD users download this one". I never have to even be aware of what CPU I'm on, every executable just seems to... work. The same goes for compilers, apparently - there isn't a separate version of, say, GCC, for every processor there is, right?
I'm aware that the differences in instruction sets are much more subtle than they used to be, but even then there should at least be a bit of a distinction. I'm wondering why there doesn't seem to be any.
What is it that I'm not understanding?

There actually are sometimes different versions for Intel/AMD. Even for different versions of Intel and/or AMD. That's not common (especially in the kind of software people usually use) because it's not user friendly (most people don't even know what a CPU is or does, let alone what kind they have exactly), so what often is that either the multiple versions are all in the same executable and selected at runtime, or a "least common denominator" sub-set of instructions is used (when performance is not really a concern). There are significant differences between AMD and Intel though, the most significant one is in which instruction sets they support. AMD always lags behind Intel in implementing Intels new instructions (new sets come out regularly), and Intel usually does not implement AMDs extensions (AMD64 is the big exception (99% accepted by Intel, small changes made), also a couple of instructions here and there were borrowed, but XOP will never happen even though it's awesome, 3DNow! was never adopted by Intel). As an example of software that does not package the code for different "extended instruction sets" in the same executable, see y-cruncher.
To get back to the beginning though, for some (I can't name any off the top of my head, but I've seen it before) high performance software you may see different versions each specifically tailored to get maximum performance on one specific microarchitecture. For example, P4 (netburst) and Core2 are two very different beasts (that's mostly P4's fault for being crazy), even though Core2 is backwards compatible and could run the same code unmodified, that is not necessarily a good idea from a performance perspective.

There is no Intel/AMD versions, because they use the same IS family: x86.
However, there are applications where you have to look out for different versions when you download them. There are still instruction sets that are quite different and might make a program act differently. For example if you have a PowerPC architecture and code a network based application on it, you can forget the little to big endian conversion, but if you compile the same code on x86, which is little endian, the application most likely will produce garbage on the network side.
There is also the difference in how many instructions there are, e.g. RISC vs CISC.
In the end there are a lot of differences to look for and in most programming languages you don't have to worry too much about them though, as the compiler/interpreter will handle most things for you. If you work lower lever then you have to know what you're doing on each architecture.
Also if you compile for ARM, you won't be able to run the program on any other machine, like your PC with x86. It will not work at all.
Because the op codes may/do differ, take the mov instruction, on x86 the op code is 0x88, on ARM it might be 0x13 etc.

The distinction is in fact quite dramatic. Except in the case of Intel vs. AMD. AMD makes their processors compatible with Intel machine code. On purpose of course.
Today there is a move to JIT compiling (Java, .NET, etc.). In this case, the executable file doesn't contain machine code. It contains a simple intermediate language that is compiled just before it is executed, in the machine code of the running processor. This allows the processor architecture to be completely opaque.

AMD is an intel clone. or vice versa depending on your view of the situation. Either way, there is so much in common that programs can be compiled as to run on either (within reason, cant go back 10 years for example, cant make a 32bit processor understand 64 bit specific instructions). Next step is the motherboards have to do similar things and they do, there maybe some intel or amd specific chip support items but then you get into generic peripherals that can be found on either platform or are widespread enough on one or the other platform that the operating systems and/or applications support them.

Related

How does a compiler, say gcc, version built years ago can still compile for a processor released recently?

Let's say I take a compiler: gcc 4.8. And processor from intel, let's say skylake or some other fancy new family.
checking this question: How to see which flags -march=native will activate?; if I do gcc -march=native -E -v - </dev/null 2>&1 | grep cc1, this will spout out some flags for the host machine, which is the above processor, skylake.
How does gcc know what flags to enable disable... when 4.8 was released before skylake processors were out? What about other newer family of processors?
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
The question isn't really specific to gcc/intel, I would like to know how others maintain synchronicity between processor and compiler too.

Old compilers don't know how to tune for new microarchitectures. (And are also missing out on better optimization in general: New versions of gcc/clang usually add new optimizations that help across the board, e.g. gcc8 can coalesce loads/stores of multiple adjacent small variables or array elements into a single 4 or 8-byte load or store. This helps on everything.)
They can also only use ISA extensions they know about.
They can make correct code because new x86 CPUs are still x86, and are backwards compatible with code for older CPUs1. Same with ARM. The ARMv8 ISA is backwards compatible with ARMv7, ARMv6, and so on, so new ARM CPUs can run existing ARM binaries. (There are some AArch64 CPUs that dropped support for 32-bit mode, but nevermind that.)
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
Yes, you want your compiler to at least know about your CPU for tuning options.
But yes, always, even when your CPU isn't new. New compiler versions often benefit old CPUs, too, but yes a new set of SIMD extensions to auto-vectorize with can lead to potentially large speedups for code that spends a lot of time in one hot loop. Assuming that loop auto-vectorizes well.
e.g. Phoronix recently posted GCC 5 Through GCC 10 Compiler Benchmarks - Five Years Worth Of C/C++ Compiler Performance where they benchmarked on an i7 5960X (Haswell-E) CPU. I think GCC5 knows about -march=haswell. GCC9.2 makes measurably faster code than even gcc8 on some benchmarks.
But I can pretty much guarantee it's not optimal!! Compilers are good over large scales but there's usually something a human can find in a single hot loop, if they know the low level details of optimizing for a given microarchitecture. It's merely as good as you're going to get from any compiler. (Actually performance regressions exist, so even that's not always true. File a missed-optimization bug if you find one).
-march=native does two separate things
CPU feature detection to enable stuff like -mfma and -mbmi2. This is easy on x86 with the CPUID instruction. GCC will enable all extensions it knows about that are supported by the actual CPU. e.g. I think GCC4.8 was the first GCC to know about any AVX512 extensions, so you might even get some AVX512 auto-vectorization on an Ice Lake or Skylake-avx512. Whether it does a good job or not is another matter, for anything non-trivial. But no AVX512 with GCC4.7.
CPU type detection to set -mtune=skylake. This depends on GCC actually recognizing your specific CPU as something it knows about. If not, it falls back to -mtune=generic. It might detect (with CPUID) your L1/L2/L3 cache sizes and use that to influence some tuning decisions like inlining / unrolling, instead of using a known size for -mtune=haswell. I don't think that's a big deal; current compilers don't AFAIK introduce cache-blocking optimizations to matmul loops or things like that, and that's where knowing cache sizes really matters.
CPU type detection can also use CPUID on x86; the vendor-string and model / family / stepping numbers uniquely identify the microarchitecture. ((wikipedia), sandpile, InstLatx64, https://agner.org/optimize/)
x86 is very much designed to support single binaries that run on multiple microarchitectures and might want do to runtime feature detection / dispatching. So an efficient / portable / extensible CPU detection mechanism exists in the form of the CPUID instruction, introduced in Pentium and some late 486 CPUs. (And thus baseline for x86-64.)
Other ISAs are more often used in embedded uses where code gets recompiled for the specific CPU. They mostly don't have as good support for runtime detection. GCC might have to install a handler for SIGILL and just try running some instructions. Or query the OS which knows what's supported, e.g. Linux's /proc/cpuinfo.
Footnote 1:
For x86 specifically, its main claim to fame / reason for popularity is strict backwards compatibility. A new CPU that fails to run some existing programs would be a lot harder to sell, so vendors don't do that. They'll even bend over backwards to go beyond the on-paper ISA docs to make sure existing code keeps working. As former Intel architect Andy Glew said: All or almost all modern Intel processors are stricter than the manual. (For self-modifying code, and in general).
Modern PC motherboard firmwares even still emulate the legacy hardware of an IBM PC/XT when you boot in legacy BIOS mode, as well as implementing a software ABI for disk, keyboard, and screen access. So even bootloaders and stuff like GRUB have a consistent backwards-compatible interface to use, before they load a kernel which has actual drivers for the real hardware that's actually present.
A modern PC can I think still run real MS-DOS (the operating system) binaries in 16-bit real mode.
Adding new instruction opcodes without breaking backwards compat makes variable-length x86 machine code instructions ever more complex, and careless / anti-competitive developments in x86's history haven't helped, leading to more bloated instruction encodings for SSSE3 and later, for example. See Agner Fog's article Stop the instruction set war.
Code that depended on rep foo to decode as foo can break, though: Intel's manuals are pretty clear that random prefixes can cause code to misbehave in future. This makes it safe for Intel or AMD to introduce new instructions that decode in a known way on old CPUs, but do something new on newer CPUs. Like pause = rep nop. Or transactional memory HLE uses prefixes on locked instructions that old CPUs will ignore.
And prefixes like VEX (AVX) and EVEX (AVX512) are carefully chosen to not overlap with valid encodings of instructions, especially in 32-bit mode. See How does the instruction decoder differentiate between EVEX prefix and BOUND opcode in 32-bit mode?. This is one reason why 32-bit mode can still only use 8 vector registers (zmm0..7) even with VEX or EVEX which allow ymm0..15 or zmm0..31 respectively in 64-bit mode. (In 32-bit mode, a VEX prefix is invalid encodings of some opcode. In 64-bit mode, that opcode isn't valid in the first place to the later bytes are more flexible. But to simplify decoder HW they aren't fundamentally different.)
MIPS32r6 / MIPS64r6 in 2014 is one notable example that's not backwards compatible. It rearranged a few opcodes for instructions that stayed the same, and removed some instructions to reuse their opcode for other new instructions, e.g. branches without a delay slot. This is highly unusual and only makes sense for CPUs that are used for embedded systems (like current MIPS). Recompiling everything for MIPS32r6 is not a problem for an embedded system.
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but still of course only for extensions that the compiler knows about when it compiles. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code.
And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
GCC has some support for this, via its ifunc mechanism, but IIRC you can't do that without source changes.
Intel's compiler (ICC) I think does support multi-versioning some hot functions when auto-vectorizing, with just command-line options.

It can only happen if the new processor is specifically designed to be backwards compatible with older models.
Forget gcc for a moment. You have a compiled X86 binary from year 2000, say, an executable built for the original Windows NT. Will a Skylake CPU run it? You betcha. Will an Itanium CPU run iit? Nope, it is not designed to do that. It is a completely different architecture
Now that executable most probably wouldn't use the Skylake efficiently, but that's the whole point of evolving architectures and introducing new instructions.
Returning to gcc, -march=native is not magic. It cannot possibly divine out the new instructions and new timings. It simply selects the "best" instruction set it knows that is supported by the CPU it runs on. How it's done is architecture specific. X86 CPUs can be queried about their capabilities with the the CPUID instruction. Other architectures may do it differently.
To put it another way, -O3 -march=native optimizes for the machine you compiled on, so it's good when you're compiling code to run on the build host. A binary built with -march=native on a Nehalem system is essentially the same as one built with -march=nehalem on any system. -march=native might detect your specific L3 cache size instead of using a default for that, if any GCC tuning decisions (like inlining or unrolling) depend on L3 size. Except if you run an old compiler on a new CPU it doesn't recognize, you get feature detection for stuff like -mavx but for tuning only tune=generic.
None of this can take advantage of new features like AVX2 or BMI2 when running on a Skylake or Ice Lake system. And some specific tuning decisions that were good on Nehalem might be sub-optimal on a different CPU. (Although this is less likely; Intel mostly maintains backwards compatibility for performance as well as correctness. Getting everyone to recompile everything for P4 didn't work out so they usually try to make existing binaries run well on new CPUs.)
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but only for extensions that the compiler knows about when it compiled. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code. And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.

Is it possible to compare ARM and x86 performance via benchmarks?

Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?

(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.

the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.

cisc versus risc

If you were writing a a textbook, and you needed to decide on a CPU design to talk about certain issues, would you choose RISC or CISC? Pros for RISC, well, you know: cleaner, easier, and so on. Cons for RISC: I would have to use an emulator, few CPUs in market, compared to x86 , and therefore less impact than if I chose a pure RISC CPU like ARM or SPARC. I'm torn, though I am leaning towards x86 and CISC... many pros, cons: not RISC, and all those books like Henessy et al. and others love RISC! So, in summary: would you choose RISC or CISC (x86)?
Thanks.

Are you talking specifically about x86 processor design, or a specific design feature which is more prevalent in x86 than anything else? No? Then go with a RISC architecture.
All that using x86 will achieve is that the x86 mechanics will obscure whatever point it is you're actually trying to make.

Fory general (or programmer) audiences I would suggest x86. No matter what language they are familiar with, the same building blocks like for and if, or the concept of functions can be easily demonstrated with x86 instructions.
Asm is abstract enough, it is priceless if they can connect the new knowledge to their existing understanding of programming. (It may even help them write better quality code).
I don't think choosing x86 means that you have to include all the instructions. The next level can be that you intoduce the machine code representation of some ASM structures, but only when they are confortable with their mnemonics first.

I prefer RISC, because they have fast and small instructions. They're cheaper than CISCs ones and have many more registers. And some instructions of CISC processors, for example intel's xchg or scas, stos, movs and cmps used with rep are so particular than many compilers never use them, so having them is only a waste of silicon and money.
And, talking about registers, even if RISC processors need more instructions to accomplish what a CISC would do with only one, they have so many registers that (I think) memory is accessed less frequently on a RISC.
I'm talking as a CISC "programmer", since I learned on the intel's x86 and sometimes having only 6 general purpose registers available, which are often reduced to 3 or 4 in loops (ecx is gone, with either edi or esi or both), can be very tricky.
Although assembly itself tends to hide the true nature of the code, I think (remember I grew with the x86) that the RISC assembly is more difficult to understand due to the larger amount of instruction needed.

If the main requirement for a project is security, not capacity since security often is a tradeoff for capacity, then I propose that the RISC architecture is used. A firewall that for instance uses x86 assembler was known to have a security issue that RISC implementation like SPARC assembler didn't. So the spec for the most secure project could be to use SPARC assembly since it is known to be a secure environment and even if it wasn't known to be a secure environment there are so few people who can program SPARC assembly compared to x86 assembly so the number of attackers who could program a security breach would be less if what a Firewall is running is a RISC architecture.
So I say that if you are talking about a firewall application, then probably it should be a RISC architecture.

Why does the Mac ABI require 16-byte stack alignment for x86-32?

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?
Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:
Consistency between the PPC, x86, and x64 version of the OS
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.
The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.
Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.
I'll give this topic another day or so then close it...
Related
It’s my stack frame, I don’t care about your stack frame!

From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:
"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."
From Appendix D:
"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."
http://www.intel.com/Assets/PDF/manual/248966.pdf

I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...

I believe it's to keep it inline with the x86-64 ABI.

First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.
The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.
You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.
Edit:
For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.

This is an efficiency issue.
Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.
On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.
Wasting stack space is cheap, it is probably the hottest part of the cache.

My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.

While I cannot really answer your question of WHY, you may find the manuals at the following site useful:
http://www.agner.org/optimize/
Regarding the ABI, have a look especially at:
http://www.agner.org/optimize/calling_conventions.pdf
Hope that's useful.

Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?
So that points to the consistency with other platforms theory.
Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)

In order to maintain consistency in kernel. This allows the same kernel to be booted on multiple architectures without modicfication.

Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?
Read this:
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20
And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:
"These are the embedding alignment modes available in the 32-bit
PowerPC environment:
Power alignment mode is derived from the alignment rules used by the
IBM XLC compiler for the AIX operating system. It is the default
alignment mode for the PowerPC-architecture version of GCC used on AIX
and Mac OS X. Because this mode is most likely to be compatible
between PowerPC-architecture compilers from different vendors, it’s
typically used with data structures that are shared between different
programs."
In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler. When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.
That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):
The called function is responsible for allocating
its own stack frame, making sure to preserve 16-byte alignment in the
stack. This operation is accomplished by a section of code called the
prolog, which the compiler places before the body of the subroutine.
After the body of the subroutine, the compiler places an epilog to
restore the processor to the state it was prior to the subroutine
call.

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?

I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor

The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.

An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.

Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.

DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio