Why does the Mac ABI require 16-byte stack alignment for x86-32?

Why does the Mac ABI require 16-byte stack alignment for x86-32? - macos

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?
Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:
Consistency between the PPC, x86, and x64 version of the OS
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.
The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.
Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.
I'll give this topic another day or so then close it...
Related
It’s my stack frame, I don’t care about your stack frame!

From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:
"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."
From Appendix D:
"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."
http://www.intel.com/Assets/PDF/manual/248966.pdf

I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...

I believe it's to keep it inline with the x86-64 ABI.

First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.
The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.
You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.
Edit:
For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.

This is an efficiency issue.
Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.
On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.
Wasting stack space is cheap, it is probably the hottest part of the cache.

My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.

While I cannot really answer your question of WHY, you may find the manuals at the following site useful:
http://www.agner.org/optimize/
Regarding the ABI, have a look especially at:
http://www.agner.org/optimize/calling_conventions.pdf
Hope that's useful.

Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?
So that points to the consistency with other platforms theory.
Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)

In order to maintain consistency in kernel. This allows the same kernel to be booted on multiple architectures without modicfication.

Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?
Read this:
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20
And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:
"These are the embedding alignment modes available in the 32-bit
PowerPC environment:
Power alignment mode is derived from the alignment rules used by the
IBM XLC compiler for the AIX operating system. It is the default
alignment mode for the PowerPC-architecture version of GCC used on AIX
and Mac OS X. Because this mode is most likely to be compatible
between PowerPC-architecture compilers from different vendors, it’s
typically used with data structures that are shared between different
programs."
In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler. When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.
That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):
The called function is responsible for allocating
its own stack frame, making sure to preserve 16-byte alignment in the
stack. This operation is accomplished by a section of code called the
prolog, which the compiler places before the body of the subroutine.
After the body of the subroutine, the compiler places an epilog to
restore the processor to the state it was prior to the subroutine
call.

Related

How can an Operating System be coded in high level languages?

I just started diving into the world of operating systems and I've learned that processes have a certain memory space they can address which is handled by the operating system. I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.

You have caught the bug and there is no cure for it :-)
The language you use to write your OS has very little to do with the way your OS operates. Yes, most people use C/C++, but there are others. As for the language, you do need a language that will let you directly communicate with the hardware you plan to manage, assembly being the main choice for this part. However, this is less than 5% of the whole project.
The code that you write must not rely upon any existing operating system. i.e.: you must code all of the function yourself, or call existing libraries. However, these existing libraries must be written so that they don't rely upon anything else.
Once you have a base, you can write your OS in any language you choose, with the minor part in assembly, something a high level language won't allow. In fact, in 64-bit code, some compilers no longer allow inline assembly, so this makes that 5% I mentioned above more like 15%.
Find out what you would like to do and then find out if that can be done in the language of choice. For example, the main operating system components can be written in C, while the actual processor management (interrupts, etc) must be done in assembly. Your boot code must be in assembly as well, at least most of it.
As mentioned in a different post, I have some early example code that you might want to look at. The boot is done in assembly, while the loader code, both Legacy BIOS and EFI, are mostly C code.

To clarify fysnet's answer, the reason you have to use at least a bit of assembly is that you can only explicitly access addressable memory in C/C++ (through pointers), while hardware registers (such as the program counter or stack pointer) often don't have memory addresses. Not only that, but some registers have to be manipulated with CPU architecture-dependent special instructions, and that, too, is only possible in machine language.
I don't quite understand how can an Operating System written in high level languages like c and c++ obtain this kind of memory management functionality.
As described above, depending on the architecture, this could be achieved by having special instructions to manage the MMU, TLB etc. INVLPG is one example of such an instruction in the x86 architecture. Note that having a special instruction requiring kernel privileges is probably the simplest way to implement such a feature in hardware in a secure manner, because then it is simply sufficient to check if the CPU is in kernel mode in order to determine whether the instruction can be executed or not.

Compilers turn high-level languages into asm / machine code for you, so you don't have to write asm yourself. You pick a compiler that handles memory the way you want your OS to; e.g. using the callstack for automatic storage, and not implicitly calling malloc / free (because those won't exist in your kernel).
To link your compiled C/C++ into a kernel, you typically have to know more about the ABI it targets, and the toolchain especially the linker.
The ISO C standard treats implementation details very much as a black box. But real compilers that people use for low level stuff work in well-known ways (i.e. make the expected/useful implementation choices) that kernel programmers depend on, in terms of compiling code and static data into contiguous blocks that can be linked into a single kernel executable that can be loaded all as one chunk.
As for actually managing the system's memory, you write code yourself to do that, with a bit of inline asm where necessary for special instructions like invlpg as other answers mention.
The entry point (where execution starts) will normally be written in pure asm, to set up a callstack with the stack pointer register pointing to it.
And set up virtual memory and so on so code is executable, data is read/write, and read-only data is readable. All of this before jumping to any compiled C code. The first C you jump to is probably more kernel init code, e.g. initializing data structures for an allocator to manage all the memory that isn't already in use by static code/data.
Creating a stack and mapping code/data into memory is the kind of setup that's normally done by an OS when starting a user-space program. The asm emitted by a compiler will assume that code, static data, and the stack are all there already.

How does a compiler, say gcc, version built years ago can still compile for a processor released recently?

Let's say I take a compiler: gcc 4.8. And processor from intel, let's say skylake or some other fancy new family.
checking this question: How to see which flags -march=native will activate?; if I do gcc -march=native -E -v - </dev/null 2>&1 | grep cc1, this will spout out some flags for the host machine, which is the above processor, skylake.
How does gcc know what flags to enable disable... when 4.8 was released before skylake processors were out? What about other newer family of processors?
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
The question isn't really specific to gcc/intel, I would like to know how others maintain synchronicity between processor and compiler too.

Old compilers don't know how to tune for new microarchitectures. (And are also missing out on better optimization in general: New versions of gcc/clang usually add new optimizations that help across the board, e.g. gcc8 can coalesce loads/stores of multiple adjacent small variables or array elements into a single 4 or 8-byte load or store. This helps on everything.)
They can also only use ISA extensions they know about.
They can make correct code because new x86 CPUs are still x86, and are backwards compatible with code for older CPUs1. Same with ARM. The ARMv8 ISA is backwards compatible with ARMv7, ARMv6, and so on, so new ARM CPUs can run existing ARM binaries. (There are some AArch64 CPUs that dropped support for 32-bit mode, but nevermind that.)
Consequently, next question is upgrading the compiler to latest necessary for it accurately and optimally compile for target processor which is new?
Yes, you want your compiler to at least know about your CPU for tuning options.
But yes, always, even when your CPU isn't new. New compiler versions often benefit old CPUs, too, but yes a new set of SIMD extensions to auto-vectorize with can lead to potentially large speedups for code that spends a lot of time in one hot loop. Assuming that loop auto-vectorizes well.
e.g. Phoronix recently posted GCC 5 Through GCC 10 Compiler Benchmarks - Five Years Worth Of C/C++ Compiler Performance where they benchmarked on an i7 5960X (Haswell-E) CPU. I think GCC5 knows about -march=haswell. GCC9.2 makes measurably faster code than even gcc8 on some benchmarks.
But I can pretty much guarantee it's not optimal!! Compilers are good over large scales but there's usually something a human can find in a single hot loop, if they know the low level details of optimizing for a given microarchitecture. It's merely as good as you're going to get from any compiler. (Actually performance regressions exist, so even that's not always true. File a missed-optimization bug if you find one).
-march=native does two separate things
CPU feature detection to enable stuff like -mfma and -mbmi2. This is easy on x86 with the CPUID instruction. GCC will enable all extensions it knows about that are supported by the actual CPU. e.g. I think GCC4.8 was the first GCC to know about any AVX512 extensions, so you might even get some AVX512 auto-vectorization on an Ice Lake or Skylake-avx512. Whether it does a good job or not is another matter, for anything non-trivial. But no AVX512 with GCC4.7.
CPU type detection to set -mtune=skylake. This depends on GCC actually recognizing your specific CPU as something it knows about. If not, it falls back to -mtune=generic. It might detect (with CPUID) your L1/L2/L3 cache sizes and use that to influence some tuning decisions like inlining / unrolling, instead of using a known size for -mtune=haswell. I don't think that's a big deal; current compilers don't AFAIK introduce cache-blocking optimizations to matmul loops or things like that, and that's where knowing cache sizes really matters.
CPU type detection can also use CPUID on x86; the vendor-string and model / family / stepping numbers uniquely identify the microarchitecture. ((wikipedia), sandpile, InstLatx64, https://agner.org/optimize/)
x86 is very much designed to support single binaries that run on multiple microarchitectures and might want do to runtime feature detection / dispatching. So an efficient / portable / extensible CPU detection mechanism exists in the form of the CPUID instruction, introduced in Pentium and some late 486 CPUs. (And thus baseline for x86-64.)
Other ISAs are more often used in embedded uses where code gets recompiled for the specific CPU. They mostly don't have as good support for runtime detection. GCC might have to install a handler for SIGILL and just try running some instructions. Or query the OS which knows what's supported, e.g. Linux's /proc/cpuinfo.
Footnote 1:
For x86 specifically, its main claim to fame / reason for popularity is strict backwards compatibility. A new CPU that fails to run some existing programs would be a lot harder to sell, so vendors don't do that. They'll even bend over backwards to go beyond the on-paper ISA docs to make sure existing code keeps working. As former Intel architect Andy Glew said: All or almost all modern Intel processors are stricter than the manual. (For self-modifying code, and in general).
Modern PC motherboard firmwares even still emulate the legacy hardware of an IBM PC/XT when you boot in legacy BIOS mode, as well as implementing a software ABI for disk, keyboard, and screen access. So even bootloaders and stuff like GRUB have a consistent backwards-compatible interface to use, before they load a kernel which has actual drivers for the real hardware that's actually present.
A modern PC can I think still run real MS-DOS (the operating system) binaries in 16-bit real mode.
Adding new instruction opcodes without breaking backwards compat makes variable-length x86 machine code instructions ever more complex, and careless / anti-competitive developments in x86's history haven't helped, leading to more bloated instruction encodings for SSSE3 and later, for example. See Agner Fog's article Stop the instruction set war.
Code that depended on rep foo to decode as foo can break, though: Intel's manuals are pretty clear that random prefixes can cause code to misbehave in future. This makes it safe for Intel or AMD to introduce new instructions that decode in a known way on old CPUs, but do something new on newer CPUs. Like pause = rep nop. Or transactional memory HLE uses prefixes on locked instructions that old CPUs will ignore.
And prefixes like VEX (AVX) and EVEX (AVX512) are carefully chosen to not overlap with valid encodings of instructions, especially in 32-bit mode. See How does the instruction decoder differentiate between EVEX prefix and BOUND opcode in 32-bit mode?. This is one reason why 32-bit mode can still only use 8 vector registers (zmm0..7) even with VEX or EVEX which allow ymm0..15 or zmm0..31 respectively in 64-bit mode. (In 32-bit mode, a VEX prefix is invalid encodings of some opcode. In 64-bit mode, that opcode isn't valid in the first place to the later bytes are more flexible. But to simplify decoder HW they aren't fundamentally different.)
MIPS32r6 / MIPS64r6 in 2014 is one notable example that's not backwards compatible. It rearranged a few opcodes for instructions that stayed the same, and removed some instructions to reuse their opcode for other new instructions, e.g. branches without a delay slot. This is highly unusual and only makes sense for CPUs that are used for embedded systems (like current MIPS). Recompiling everything for MIPS32r6 is not a problem for an embedded system.
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but still of course only for extensions that the compiler knows about when it compiles. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code.
And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.
GCC has some support for this, via its ifunc mechanism, but IIRC you can't do that without source changes.
Intel's compiler (ICC) I think does support multi-versioning some hot functions when auto-vectorizing, with just command-line options.

It can only happen if the new processor is specifically designed to be backwards compatible with older models.
Forget gcc for a moment. You have a compiled X86 binary from year 2000, say, an executable built for the original Windows NT. Will a Skylake CPU run it? You betcha. Will an Itanium CPU run iit? Nope, it is not designed to do that. It is a completely different architecture
Now that executable most probably wouldn't use the Skylake efficiently, but that's the whole point of evolving architectures and introducing new instructions.
Returning to gcc, -march=native is not magic. It cannot possibly divine out the new instructions and new timings. It simply selects the "best" instruction set it knows that is supported by the CPU it runs on. How it's done is architecture specific. X86 CPUs can be queried about their capabilities with the the CPUID instruction. Other architectures may do it differently.
To put it another way, -O3 -march=native optimizes for the machine you compiled on, so it's good when you're compiling code to run on the build host. A binary built with -march=native on a Nehalem system is essentially the same as one built with -march=nehalem on any system. -march=native might detect your specific L3 cache size instead of using a default for that, if any GCC tuning decisions (like inlining or unrolling) depend on L3 size. Except if you run an old compiler on a new CPU it doesn't recognize, you get feature detection for stuff like -mavx but for tuning only tune=generic.
None of this can take advantage of new features like AVX2 or BMI2 when running on a Skylake or Ice Lake system. And some specific tuning decisions that were good on Nehalem might be sub-optimal on a different CPU. (Although this is less likely; Intel mostly maintains backwards compatibility for performance as well as correctness. Getting everyone to recompile everything for P4 didn't work out so they usually try to make existing binaries run well on new CPUs.)
Some compiles can make binaries that do runtime CPU detection and dispatching so they can take advantage of whatever a CPU supports, but only for extensions that the compiler knows about when it compiled. The AVX+FMA machine-code version of a function has to be there in the executable, so a compiler from before those were even announced wouldn't have been able to create such machine code. And before real CPUs with the features were available, compiler devs hadn't had a chance to tune code-gen for those features yet, so a newer compiler might make better code for the same CPU features.

Why isn't the distinction between CPUs more ubiquitous?

I know that every program one writes has to eventually boil down to machine code - that's what compilers produce, that's what executable files consist of, and that's the only language that processors understand. I also know that different processors may have different instruction sets (I know 65c816 assembly, and I imagine it's vastly different from today's computers).
Here's what I'm not getting, though: If there exist different instruction sets, then why do we not seem to have to care about that every time we use software?
If a program was compiled for one particular CPU, it might not run on another - and yet, I never see notices like "Intel users download this version, AMD users download this one". I never have to even be aware of what CPU I'm on, every executable just seems to... work. The same goes for compilers, apparently - there isn't a separate version of, say, GCC, for every processor there is, right?
I'm aware that the differences in instruction sets are much more subtle than they used to be, but even then there should at least be a bit of a distinction. I'm wondering why there doesn't seem to be any.
What is it that I'm not understanding?

There actually are sometimes different versions for Intel/AMD. Even for different versions of Intel and/or AMD. That's not common (especially in the kind of software people usually use) because it's not user friendly (most people don't even know what a CPU is or does, let alone what kind they have exactly), so what often is that either the multiple versions are all in the same executable and selected at runtime, or a "least common denominator" sub-set of instructions is used (when performance is not really a concern). There are significant differences between AMD and Intel though, the most significant one is in which instruction sets they support. AMD always lags behind Intel in implementing Intels new instructions (new sets come out regularly), and Intel usually does not implement AMDs extensions (AMD64 is the big exception (99% accepted by Intel, small changes made), also a couple of instructions here and there were borrowed, but XOP will never happen even though it's awesome, 3DNow! was never adopted by Intel). As an example of software that does not package the code for different "extended instruction sets" in the same executable, see y-cruncher.
To get back to the beginning though, for some (I can't name any off the top of my head, but I've seen it before) high performance software you may see different versions each specifically tailored to get maximum performance on one specific microarchitecture. For example, P4 (netburst) and Core2 are two very different beasts (that's mostly P4's fault for being crazy), even though Core2 is backwards compatible and could run the same code unmodified, that is not necessarily a good idea from a performance perspective.

There is no Intel/AMD versions, because they use the same IS family: x86.
However, there are applications where you have to look out for different versions when you download them. There are still instruction sets that are quite different and might make a program act differently. For example if you have a PowerPC architecture and code a network based application on it, you can forget the little to big endian conversion, but if you compile the same code on x86, which is little endian, the application most likely will produce garbage on the network side.
There is also the difference in how many instructions there are, e.g. RISC vs CISC.
In the end there are a lot of differences to look for and in most programming languages you don't have to worry too much about them though, as the compiler/interpreter will handle most things for you. If you work lower lever then you have to know what you're doing on each architecture.
Also if you compile for ARM, you won't be able to run the program on any other machine, like your PC with x86. It will not work at all.
Because the op codes may/do differ, take the mov instruction, on x86 the op code is 0x88, on ARM it might be 0x13 etc.

The distinction is in fact quite dramatic. Except in the case of Intel vs. AMD. AMD makes their processors compatible with Intel machine code. On purpose of course.
Today there is a move to JIT compiling (Java, .NET, etc.). In this case, the executable file doesn't contain machine code. It contains a simple intermediate language that is compiled just before it is executed, in the machine code of the running processor. This allows the processor architecture to be completely opaque.

AMD is an intel clone. or vice versa depending on your view of the situation. Either way, there is so much in common that programs can be compiled as to run on either (within reason, cant go back 10 years for example, cant make a 32bit processor understand 64 bit specific instructions). Next step is the motherboards have to do similar things and they do, there maybe some intel or amd specific chip support items but then you get into generic peripherals that can be found on either platform or are widespread enough on one or the other platform that the operating systems and/or applications support them.

Is there performance advantage to ARM64

Recently 64-bit ARM mobiles started appearing. But is there any practical advantage to building an application 64-bit? Specifically considering application that does not have much use for the increased virtual address space¹, but would waste some space due to increased pointer size.
So does ARM64 have any other advantages than the larger address that would actually warrant building such application 64bit?
Note: I've seen 64-bit Performance Advantages, but it only mentions x86-64 which does have other improvements besides extended virtual address space. I also recall that the situation is indeed specific to x86 and on some other platforms that went 64-bit like Sparc the usual approach was to only compile kernel and the applications that actually did use lot of memory as 64-bit and everything else as 32-bit.
¹The application is multi-platform and it still needs to be built for and run on devices with as little as 48MiB of memory. Does have some large data that it reads from external storage, but it never needs more than some megabytes of it at once.

I am not sure a general response can be given, but I can provide some examples of differences. There are of course additional differences added in version 8 of the ARM architecture, which apply regardless of target instruction set.
Performance-positive additions in AArch64
32 General-purpose registers gives compilers more wiggle room.
I/D cache synchronization mechanisms accessible from user mode (no system call needed).
Load/Store-Pair instructions makes it possible to load 128-bits of data with one instruction, and still remain RISC-like.
The removal of near-universal conditional execution makes more out-of-ordering possible.
The change in layout of NEON registers (D0 is still lower half of Q0, but D1 is now lower half of Q1 rather than upper half of Q0) makes more out-of-ordering possible.
64-bit pointers make pointer tagging possible.
CSEL enables all kind of crazy optimizations.
Performance-negative changes in AArch64
More registers may also mean higher pressure on the stack.
Larger pointers mean larger memory footprint.
Removal of near-universal conditional execution may cause higher pressure on branch predictor.
Removal of load/store-multiple means more instructions needed for function entry/exit.
Performance-relevant changes in ARMv8-A
Load-Aquire/Store-Release semantics remove need for explicit memory barriers for basic synchronization operations.
I probably forgot lots of things, but those are some of the more obvious changes.

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?

I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor

The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.

An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.

Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.

DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio