Related
I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:
__m256d d1 = _mm256_loadu_pd(vInOut + i*4);
I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).When I look into the underlying assembler code, I find the following:
vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1
I was expecting to see this:
vmovupd ymm0,XMMWORD PTR [rsi+rax*1]
and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.
Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.
I'm running gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 on Ubuntu 18.04 x86-64.
GCC's default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.
Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don't want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There's no "generic-avx2" tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).
Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don't know the details, and haven't found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.
According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), "It speeds up some SPEC CPU 2006 benchmarks by up to 6%." (I think that's for Sandybridge, the only Intel AVX CPU that existed at the time.)
But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you'd better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.
Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it's the pessimistic approach, and makes sense if it's really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that's called most of the time with aligned buffers, but you still want it to work for rare / small cases where it's called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.
It's common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it's always misaligned for alignments larger than 16. Use aligned_alloc instead.
Note that -mavx and -mavx2 don't change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can't actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for "the average AVX2 CPU". Unfortunately gcc has no option to do that, and -mavx2 doesn't imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.
This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don't have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)
gcc8.2's tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.
I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.
Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.
default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don't enable.
-march=sandybridge and -march=ivybridge: split both. (I think I've read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it's less appropriate for cases where the data might be aligned at runtime.)
-march=haswell and later: neither splitting option enabled.
-march=knl: neither splitting option enabled. (Silvermont/Atom don't have AVX)
-mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8's normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)
-march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads.
It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
-march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
-march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
-march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
-march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn't even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there's some strange front-end for this.
options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog's microarch pdf (https://agner.org/optimize/). This effect isn't present in Bulldozer or Steamroller/Excavator.
Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can't decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn't cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.
But closed-source software or binary distributions typically don't have the luxury of building with -march=native on every target architecture, so there's a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren't catastrophic downsides on other CPUs.
Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn't need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn't micro-fuse, so it costs 2 uops of front-end bandwidth.)
PS:
Most code isn't compiled by bleeding edge compilers, so changing the "generic" tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.
GCC's generic tuning splits unaligned 256-bit loads to help older processors. (Subsequent changes avoid splitting loads in generic tuning, I believe.)
You can tune for more recent Intel CPUs using something like -mtune=intel or -mtune=skylake, and you will get a single instruction, as intended.
I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:
__m256d d1 = _mm256_loadu_pd(vInOut + i*4);
I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).When I look into the underlying assembler code, I find the following:
vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1
I was expecting to see this:
vmovupd ymm0,XMMWORD PTR [rsi+rax*1]
and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.
Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.
I'm running gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 on Ubuntu 18.04 x86-64.
GCC's default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.
Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don't want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There's no "generic-avx2" tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).
Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don't know the details, and haven't found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.
According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), "It speeds up some SPEC CPU 2006 benchmarks by up to 6%." (I think that's for Sandybridge, the only Intel AVX CPU that existed at the time.)
But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you'd better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.
Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it's the pessimistic approach, and makes sense if it's really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that's called most of the time with aligned buffers, but you still want it to work for rare / small cases where it's called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.
It's common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it's always misaligned for alignments larger than 16. Use aligned_alloc instead.
Note that -mavx and -mavx2 don't change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can't actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for "the average AVX2 CPU". Unfortunately gcc has no option to do that, and -mavx2 doesn't imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.
This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don't have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)
gcc8.2's tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.
I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.
Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.
default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don't enable.
-march=sandybridge and -march=ivybridge: split both. (I think I've read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it's less appropriate for cases where the data might be aligned at runtime.)
-march=haswell and later: neither splitting option enabled.
-march=knl: neither splitting option enabled. (Silvermont/Atom don't have AVX)
-mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8's normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)
-march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads.
It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
-march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
-march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
-march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
-march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn't even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there's some strange front-end for this.
options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog's microarch pdf (https://agner.org/optimize/). This effect isn't present in Bulldozer or Steamroller/Excavator.
Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can't decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn't cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.
But closed-source software or binary distributions typically don't have the luxury of building with -march=native on every target architecture, so there's a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren't catastrophic downsides on other CPUs.
Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn't need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn't micro-fuse, so it costs 2 uops of front-end bandwidth.)
PS:
Most code isn't compiled by bleeding edge compilers, so changing the "generic" tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.
GCC's generic tuning splits unaligned 256-bit loads to help older processors. (Subsequent changes avoid splitting loads in generic tuning, I believe.)
You can tune for more recent Intel CPUs using something like -mtune=intel or -mtune=skylake, and you will get a single instruction, as intended.
How do the performance of atomic operations operating on the natural word size of the processor (4-byte or 8-byte) compare to that of operating on other sizes (2-byte or 1-byte)?
If I need to maintain a boolean atomic variable, I'm trying to figure out the what the best practice is: use 1-byte to optimize for space, or 4/8-byte to (potentially) optimize for performance.
http://agner.org/optimize/ for lots of details.
On x86, an array of 1-byte data should be good. It can be loaded with movzx (zero-extend) just as fast as with a plain mov.
x86 has bit ops to support atomic bitfields, if you want to pack your data by another factor of 8. I'm not sure how well compilers will do at making efficient code for that case, though. Even a write-only operation requires a slow atomic RMW cycle for the byte holding the bit you want to write. (On x86, it would a lock OR instruction, which is a full memory barrier. It's 8 uops on Intel Haswell, vs. 1 for a byte store. A factor of 19 in throughput.) This is probably still worth it if it means the difference between lots of cache misses and few cache misses, esp. if most of the access is read-only. (Reading a bit is fast, exactly the same as the non-atomic case.)
2-byte (16bit) operations are potentially slow on x86, esp. on Intel CPUs. Intel instruction decoders slow down a lot when they have to decode an instruction with a 16bit immediate operand. This is the dreaded LCP stall from the operand-size prefix. (8b ops have a whole different opcode, and 32 vs. 64bit is selected by the REX prefix, which doesn't slow down the decoders). So 16b is the odd-one-out, and you should be careful using it. Prefer to load 16b memory into 32b variables to avoid partial-register penalties and 16bit immediates when working with a temporary. (AMD CPUs aren't quite as efficient at handling movzx loads (takes an ALU unit and extra 1 cycle latency), but the savings in memory are still almost always worth it (for cache reasons)).
32b is the "optimal" size to use for local scratch variables. No prefix is needed to select that size (increasing code density), and there won't be partial-register stalls or extra uops when using the full register again after using the low 8b. I believe this is the purpose of the int_fast32_t type, but on x86 Linux that type is unfortunately 64bit.
Microsoft offers the InterlockedCompareExchange function for performing atomic compare-and-swap operations. There is also an _InterlockedCompareExchange intrinsic.
On x86 these are implemented using the lock cmpxchg instruction.
However, reading through the documentation on these three approaches, they don't seem to agree on the alignment requirements.
Intel's reference manual says nothing about alignment (other than that if alignment checking is enabled and an unaligned memory reference is made, an exception is generated)
I also looked up the lock prefix, which specifically states that
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
(emphasis mine)
So Intel seems to say that alignment is irrelevant. The operation will be atomic no matter what.
The _InterlockedCompareExchange intrinsic documentation also says nothing about alignment, however the InterlockedCompareExchange function states that
The parameters for this function must be aligned on a 32-bit boundary; otherwise, the function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
So what gives?
Are the alignment requirements for InterlockedCompareExchange just to make sure the function will work even on pre-486 CPU's where the cmpxchg instruction isn't available?
That seems likely based on the above information, but I'd like to be sure before I rely on it. :)
Or is alignment required by the ISA to guarantee atomicity, and I'm just looking the wrong places in Intel's reference manuals?
x86 does not require alignment for a lock cmpxchg instruction to be atomic. However, alignment is necessary for good performance.
This should be no surprise, backward compatibility means that software written with a manual from 14 years ago will still run on today's processors. Modern CPUs even have a performance counter specifically for split-lock detection because it's so expensive. (The core can't just hold onto exclusive access to a single cache line for the duration of the operation; it does have to do something like a traditional bus lock).
Why exactly Microsoft documents an alignment requirement is not clear. It's certainly necessary for supporting RISC architectures, but the specific claim of unpredictable behaviour on multiprocessor x86 might not even be valid. (Unless they mean unpredictable performance, rather than a correctness problem.)
Your guess of applying only to pre-486 systems without lock cmpxchg might be right; a different mechanism would be needed there which might have required some kind of locking around pure loads or pure stores. (Also note that 486 cmpxchg has a different and currently-undocumented opcode (0f a7) from modern cmpxchg (0f b1) which was new with 586 Pentium; Windows might have only used cmpxchg on P5 Pentium and later, I don't know.) That could maybe explain weirdness on some x86, without implying weirdness on modern x86.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A): System Programming Guide
January 2013
8.1.2.2 Software Controlled Bus Locking
To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location. [...]
• The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
• The LOCK prefix is automatically assumed for XCHG instruction.
• [...]
[...] The integrity of a bus lock is not affected by the alignment of the
memory field. The LOCK semantics are followed for as many bus cycles
as necessary to update the entire operand. However, it is recommend
that locked accesses be aligned on their natural boundaries for better
system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.
Fun fact: cmpxchg without a lock prefix is still atomic wrt. context switches, so is usable for multi-threading on a single-core system.
Even misaligned it's still atomic wrt. interrupts (either completely before or completely after), and only memory reads by other devices (e.g. DMA) could see tearing. But such accesses could also see the separation between load and store, so even if old Windows did use that for a more efficient InterlockedCompareExchange on single-core systems, it still wouldn't require alignment for correctness, only performance. If this can be used for hardware access, Windows probably wouldn't do that.
If the library function needed to do a pure load separate from the lock cmpxchg this might make sense, but it doesn't need to do that. (If not inlined, the 32-bit version would have to load its args from the stack, but that's private, not access to the shared variable.)
The PDF you are quoting from is from 1999 and CLEARLY outdated.
The up-to-date Intel documentation, specifically Volume-3A tells a different story.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3A, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided
See this SO question: natural alignment is important for performance, and is required on the x64 architecture (so it's not just PRE-x86 systems, but POST-x86 ones too -- x64 may still be a bit of a niche case but it's growing in popularity after all;-); that may be why Microsoft documents it as required (hard to find docs on whether MS has decided to FORCE the alignment issue by enabling alignment checking -- that may vary by Windows version; by claiming in the docs that alignment is required, MS keeps the freedom to force it in some version of Windows even if they did not force it on others).
Microsoft's Interlocked APIs also applied to ia64 (while it still existed). There was no lock prefix on ia64, only the cmpxchg.acq and cmpxchg.rel instructions (or fetchadd and other similar beasties), and these all required alignment if I recall correctly.
I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?
Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:
Consistency between the PPC, x86, and x64 version of the OS
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.
The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.
Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.
I'll give this topic another day or so then close it...
Related
It’s my stack frame, I don’t care about your stack frame!
From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:
"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."
From Appendix D:
"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."
http://www.intel.com/Assets/PDF/manual/248966.pdf
I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...
I believe it's to keep it inline with the x86-64 ABI.
First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.
The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.
You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.
Edit:
For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.
This is an efficiency issue.
Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.
On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.
Wasting stack space is cheap, it is probably the hottest part of the cache.
My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.
While I cannot really answer your question of WHY, you may find the manuals at the following site useful:
http://www.agner.org/optimize/
Regarding the ABI, have a look especially at:
http://www.agner.org/optimize/calling_conventions.pdf
Hope that's useful.
Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?
So that points to the consistency with other platforms theory.
Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)
In order to maintain consistency in kernel. This allows the same kernel to be booted on multiple architectures without modicfication.
Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?
Read this:
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20
And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:
"These are the embedding alignment modes available in the 32-bit
PowerPC environment:
Power alignment mode is derived from the alignment rules used by the
IBM XLC compiler for the AIX operating system. It is the default
alignment mode for the PowerPC-architecture version of GCC used on AIX
and Mac OS X. Because this mode is most likely to be compatible
between PowerPC-architecture compilers from different vendors, it’s
typically used with data structures that are shared between different
programs."
In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler. When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.
That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):
The called function is responsible for allocating
its own stack frame, making sure to preserve 16-byte alignment in the
stack. This operation is accomplished by a section of code called the
prolog, which the compiler places before the body of the subroutine.
After the body of the subroutine, the compiler places an epilog to
restore the processor to the state it was prior to the subroutine
call.