Does modern x86-64 cpu still benefit from memory data alignment?

Does modern x86-64 cpu still benefit from memory data alignment? - performance

As titled,
I've searched a lot of old articles on the Internet regarding memory data alignment, but I am not sure whether they are still useful nowadays. So, the question is regarding modern x86-64 CPUs, whether the memory data alignment is still beneficial for efficient data access? Or just an old convention adopted by all compilers for backward compatibility?

Yes for arrays because it means you'll avoid cache-line and page splits. See How can I accurately benchmark unaligned access speed on x86_64? for details on various penalties and how to measure them.
This is especially significant for vectorization with AVX-512, where looping over a misaligned array with 512-bit vectors means every load is a cache-line split. The penalty can be ~20%, vs. a few % with AVX 256-bit vectors on the same CPU even for data that's coming from L3 or DRAM, not L2 or L1d hits.
No for misalignment within a single cache line, for both integer and SIMD loads/stores on modern AMD and Intel microarchitectures. (Except for legacy-SSE where only aligned loads can be folded into a memory source operand like addps xmm0, [rdi] instead of separate movups. Unlike AVX where vaddps xmm0, xmm0, [rdi] doesn't require alignment.)
And yes potentially indirectly in terms of keeping all a struct's members in the same cache line, improving spatial locality.

Related

gcc assembly for stack vs heap access [duplicate]

I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:
__m256d d1 = _mm256_loadu_pd(vInOut + i*4);
I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).When I look into the underlying assembler code, I find the following:
vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1
I was expecting to see this:
vmovupd ymm0,XMMWORD PTR [rsi+rax*1]
and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.
Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.
I'm running gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 on Ubuntu 18.04 x86-64.

GCC's default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.
Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don't want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There's no "generic-avx2" tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).
Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don't know the details, and haven't found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.
According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), "It speeds up some SPEC CPU 2006 benchmarks by up to 6%." (I think that's for Sandybridge, the only Intel AVX CPU that existed at the time.)
But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you'd better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.
Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it's the pessimistic approach, and makes sense if it's really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that's called most of the time with aligned buffers, but you still want it to work for rare / small cases where it's called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.
It's common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it's always misaligned for alignments larger than 16. Use aligned_alloc instead.
Note that -mavx and -mavx2 don't change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can't actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for "the average AVX2 CPU". Unfortunately gcc has no option to do that, and -mavx2 doesn't imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.
This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don't have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)
gcc8.2's tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.
I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.
Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.
default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don't enable.
-march=sandybridge and -march=ivybridge: split both. (I think I've read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it's less appropriate for cases where the data might be aligned at runtime.)
-march=haswell and later: neither splitting option enabled.
-march=knl: neither splitting option enabled. (Silvermont/Atom don't have AVX)
-mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8's normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)
-march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads.
It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
-march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
-march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
-march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
-march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn't even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there's some strange front-end for this.
options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog's microarch pdf (https://agner.org/optimize/). This effect isn't present in Bulldozer or Steamroller/Excavator.
Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can't decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn't cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.
But closed-source software or binary distributions typically don't have the luxury of building with -march=native on every target architecture, so there's a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren't catastrophic downsides on other CPUs.
Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn't need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn't micro-fuse, so it costs 2 uops of front-end bandwidth.)
PS:
Most code isn't compiled by bleeding edge compilers, so changing the "generic" tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.

GCC's generic tuning splits unaligned 256-bit loads to help older processors. (Subsequent changes avoid splitting loads in generic tuning, I believe.)
You can tune for more recent Intel CPUs using something like -mtune=intel or -mtune=skylake, and you will get a single instruction, as intended.

Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:
__m256d d1 = _mm256_loadu_pd(vInOut + i*4);
I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).When I look into the underlying assembler code, I find the following:
vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1
I was expecting to see this:
vmovupd ymm0,XMMWORD PTR [rsi+rax*1]
and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.
Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.
I'm running gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 on Ubuntu 18.04 x86-64.

GCC's generic tuning splits unaligned 256-bit loads to help older processors. (Subsequent changes avoid splitting loads in generic tuning, I believe.)
You can tune for more recent Intel CPUs using something like -mtune=intel or -mtune=skylake, and you will get a single instruction, as intended.

What's the cost of moving data between VPU and FPU registers?

On PowerPC (PS3), transferring between the vector registers and the floating point registers also passed through memory, which could lead to expensive cache misses and thus required minimizing unnecessary conversions.
Is this true for other modern architectures? I'm especially curious about mobile, where my understanding is that the memory latency is by far the limiting factor.
NOTE: This is for a low level 3D math library using SSE intrinsics (and eventually others), and I'm trying to optimize for memory latency.

Estimating Cycles Per Instruction

I have disassembled a small C++ program compiled with MSVC v140 and am trying to estimate the cycles per instruction in order to better understand how code design impacts performance. I've been following Mike Acton's CppCon 2014 talk on "Data-Oriented Design and C++", specifically the portion I've linked to.
In it, he points out these lines:
movss 8(%rbx), %xmm1
movss 12(%rbx), %xmm0
He then claims that these 2 x 32-bit reads are probably on the same cache line therefore cost roughly ~200 cycles.
The Intel 64 and IA-32 Architectures Optimization Reference Manual has been a great resource, specifically "Appendix C - Instruction Latency and Throughput". However on page C-15 in "Table C-16. Streaming SIMD Extension Single-precision Floating-point Instructions" it states that movss is only 1 cycle (unless I'm understanding what latency means here wrong... if so, how do I read this thing?)
I know that a theoretical prediction of execution time will never be correct, but nevertheless this is important to learn. How are these two commands 200 cycles, and how can I learn to reason about execution time beyond this snippet?
I've started to read some things on CPU pipelining... maybe the majority of the cycles are being picked up there?
PS: I'm not interested in actually measuring hardware performance counters here. I'm just looking to learn how to reasonable sight read ASM and cycles.

As you already pointed out, the theoretical throughput and latency of a MOVSS instruction is at 1 cycle. You were looking at the right document (Intel Optimization Manual). Agner Fog (mentioned in the comments) measured the same numbers in his Intruction Tables for Intel CPUs (AMD is has a higher latency).
This leads us to the first problem: What specific microarchitecture are you investigating? This can make a big difference, even for the same vendor. Agner Fog reports that MOVSS has a 2-6cy latency on AMD Bulldozer depending on the source and destination (register vs memory). This is important to keep in mind when looking into performance of computer architectures.
The 200cy are most likely cache misses, as already pointed out be dwelch in the comments. The numbers you get from the Optimization Manual for any memory accessing instructions are all under the assumption that the data resides in the first level cache (L1). Now, if you have never touched the data by previous instructions the cache line (64 bytes with Intel and AMD x86) will need to be loaded from memory into the last level cache, form there into the second level cache, then into L1 and finally into the XMM register (within 1 cycle). Transfers between L3-L2 and L2-L1 have a throughput (not latency!) of two cycles per cache line on current Intel microarchitectures. And the memory bandwidth can be used to estimate the throughput between L3 and memory (e.g, a 2 GHz CPU with an achievable memory bandwidth of 40 GB/s will have a throughput of 3.2 cycles per cache line). Cache lines or memory blocks are typically the smallest unit caches and memory can operate on, they differ between microarchitectures and may even be different within the architecture, depending on the cache level (L1, L2 and so on).
Now this is all throughput and not latency, which will not help you estimate what you have described above. To verify this, you would need to execute the instructions over and over (for at least 1/10s) to get cycle accurate measurements. By changing the instructions you can decide if you want to measure for latency (by including dependencies between instructions) or throughput (by having the instructions input independent of the result of previous instructions). To measure for caches and memory accesses you would need to predict if an access is going to a cache or not, this can be done using layer conditions.
A tool to estimate instruction execution (both latency and throughput) for Intel CPUs is the Intel Architecture Code Analyzer, which supports multiple microarchitectures up to Haswell. The latency predictions are to be taken with the grain of salt, since it much harder to estimate latency than throughput.

Performance comparison of atomic operations on different sizes

How do the performance of atomic operations operating on the natural word size of the processor (4-byte or 8-byte) compare to that of operating on other sizes (2-byte or 1-byte)?
If I need to maintain a boolean atomic variable, I'm trying to figure out the what the best practice is: use 1-byte to optimize for space, or 4/8-byte to (potentially) optimize for performance.

http://agner.org/optimize/ for lots of details.
On x86, an array of 1-byte data should be good. It can be loaded with movzx (zero-extend) just as fast as with a plain mov.
x86 has bit ops to support atomic bitfields, if you want to pack your data by another factor of 8. I'm not sure how well compilers will do at making efficient code for that case, though. Even a write-only operation requires a slow atomic RMW cycle for the byte holding the bit you want to write. (On x86, it would a lock OR instruction, which is a full memory barrier. It's 8 uops on Intel Haswell, vs. 1 for a byte store. A factor of 19 in throughput.) This is probably still worth it if it means the difference between lots of cache misses and few cache misses, esp. if most of the access is read-only. (Reading a bit is fast, exactly the same as the non-atomic case.)
2-byte (16bit) operations are potentially slow on x86, esp. on Intel CPUs. Intel instruction decoders slow down a lot when they have to decode an instruction with a 16bit immediate operand. This is the dreaded LCP stall from the operand-size prefix. (8b ops have a whole different opcode, and 32 vs. 64bit is selected by the REX prefix, which doesn't slow down the decoders). So 16b is the odd-one-out, and you should be careful using it. Prefer to load 16b memory into 32b variables to avoid partial-register penalties and 16bit immediates when working with a temporary. (AMD CPUs aren't quite as efficient at handling movzx loads (takes an ALU unit and extra 1 cycle latency), but the savings in memory are still almost always worth it (for cache reasons)).
32b is the "optimal" size to use for local scratch variables. No prefix is needed to select that size (increasing code density), and there won't be partial-register stalls or extra uops when using the full register again after using the low 8b. I believe this is the purpose of the int_fast32_t type, but on x86 Linux that type is unfortunately 64bit.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio