Is there a way to force visual studio to generate aligned instructions from SSE intrinsics? - visual-studio

The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead.
Since not all compilers are made the same, this hides bugs. It would be nice to be able to be able to turn the actual aligned operations on, even though the performance hit that used to be there doesn't seem to be there anymore.
In other words, writing code:
__m128 p1 = _mm_load_ps(data);
currently produces:
movups xmm0,xmmword ptr [eax]
expected result:
movaps xmm0,xmmword ptr [eax]
(I was asked by microsoft to ask here)

MSVC and ICC only use instructions that do alignment checking when they fold a load into a memory source operand without AVX enabled, like addps xmm0, [rax]. SSE memory source operands require alignment, unlike AVX. But you can't reliably control when this happens, and in debug builds it generally doesn't.
As Mysticial points out in Visual Studio 2017: _mm_load_ps often compiled to movups , another case is NT load/store, because there is no unaligned version.
If your code is compatible with clang-cl, have Visual Studio use it instead of MSVC. It's a modified version of clang that tries to act more like MSVC. But like GCC, clang uses aligned load and store instructions for aligned intrinsics.
Either disable optimization, or make sure AVX is not enabled, otherwise it could fold a _mm_load_ps into a memory source operand like vaddps xmm0, [rax] which doesn't require alignment because it's the AVX version. This may be a problem if your code also uses AVX intrinsics in the same file, because clang requires that you enable ISA extensions for intrinsics you want to use; the compiler won't emit asm instructions for an extension that isn't enabled, even with intrinsics. Unlike MSVC and ICC.
A debug build should work even with AVX enabled, especially if you _mm_load_ps or _mm256_load_ps into a separate variable in a separate statement, not v=_mm_add_ps(v, _mm_load_ps(ptr));
With MSVC itself, for debugging purposes only (usually very big speed penalty for stores), you could substitute normal loads/stores with NT. Since they're special, the compiler won't fold loads into memory source operands for ALU instructions, so this can maybe work even with AVX with optimization enabled.
// alignment_debug.h (untested)
// #include this *after* immintrin.h
#ifdef DEBUG_SIMD_ALIGNMENT
#warn "using slow alignment-debug SIMD instructions to work around MSVC/ICC limitations"
// SSE4.1 MOVNTDQA doesn't do anything special on normal WB memory, only WC
// On WB, it's just a slower MOVDQA, wasting an ALU uop.
#define _mm_load_si128 _mm_stream_load_si128
#define _mm_load_ps(ptr) _mm_castsi128_ps(_mm_stream_load_si128((const __m128i*)ptr))
#define _mm_load_pd(ptr) _mm_castsi128_pd(_mm_stream_load_si128((const __m128i*)ptr))
// SSE1/2 MOVNTPS / PD / MOVNTDQ evict data from cache if it was hot, and bypass cache
#define _mm_store_ps _mm_stream_ps // SSE1 movntps
#define _mm_store_pd _mm_stream_pd // SSE2 movntpd is a waste of space vs. the ps encoding, but whatever
#define _mm_store_si128 _mm_stream_si128 // SSE2 movntdq
// and repeat for _mm256_... versions with _mm256_castsi256_ps
// and _mm512_... versions
// edit welcome if anyone tests this and adds those versions
#endif
Related: for auto-vectorization with MSVC (and gcc/clang), see Alex's answer on Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

Related

What exactly do the gcc compiler switches (-mavx -mavx2 -mavx512f) do?

I explicitly use the Intel SIMD extensions intrinsic in my C/C++ code. In order to compile the code I need to specify -mavx, or -mavx512, or something similar on the command line. I'm good with all that.
However, from reading the gcc man page, it's not clear if these command-line flags also tell the gcc compiler to try to automatically vectorize the C/C++ code with the Intel SIMD instructions. Does someone know if that is the case? Does the -mavx flag simply allow you to manually insert SIMD intrinsics into your code, or does it also tell the compiler to use the SIMD instructions when compiling you C/C++ code?
-mavx/-mavx2/-mavx512f (and -march= options that imply them with relevant tuning settings) let GCC use AVX / AVX2 / AVX-512 instructions for anything it thinks is a good idea when compiling your code, including but not limited to auto-vectorization of loops, if you also enable that.
Other use-cases for SSE instructions (where GCC will use the AVX encoding if you tell it AVX is enabled) include copying and zero-initializing structs and arrays, and other cases of inlining small constant-size memset and memcpy. And also scalar FP math, even at -O0 in 64-bit code where -mfpmath=sse is the default.
Code built with -mavx usually can't be run on CPUs without AVX, even if auto-vectorization wasn't enabled and you didn't use any AVX intrinsics; it makes GCC use the VEX encoding instead of legacy SSE for every SIMD instruction. AVX2, on the other hand, doesn't usually get used except when actually auto-vectorizing a loop. It's not relevant for just copying data around, or for scalar FP math. GCC will use scalar FMA instructions if -mfma is enabled, though.
Examples on Godbolt
void ext(void *);
void caller(void){
int arr[16] = {0};
ext(arr);
}
double fp(double a, double b){
return b-a;
}
compiles with AVX instructions with gcc -O2 -fno-tree-vectorize -march=haswell, because when AVX is enabled, GCC completely avoids legacy-SSE encodings everywhere.
caller:
sub rsp, 72
vpxor xmm0, xmm0, xmm0
mov rdi, rsp
vmovdqa XMMWORD PTR [rsp], xmm0 # only 16-byte vectors, not using YMM + vzeroupper
vmovdqa XMMWORD PTR [rsp+16], xmm0
vmovdqa XMMWORD PTR [rsp+32], xmm0
vmovdqa XMMWORD PTR [rsp+48], xmm0
call ext
add rsp, 72
ret
fp:
vsubsd xmm0, xmm1, xmm0
ret
-m options do not enable auto-vectorization; -ftree-vectorize does that. It's on at -O3 and higher. (Or at -O2 with GCC12 and later, like with clang.)
If you do want auto-vectorization with enabled extensions, use -O3 as well, and preferably -march=native or -march=znver2 or something instead of just -mavx2. -march sets tuning options as well, and will enable other ISA extension you probably forgot about, like -mfma and -mbmi2.
The tuning options implied by -march=haswell (or just -mtune=haswell) are especially useful on older GCC, when tune=generic cared more about old CPUs that didn't have AVX2, or where doing unaligned 256-bit loads as two separate parts was a win in some cases: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?
Unfortunately there isn't anything like -mtune=generic-avx2 or -mtune=enabled-extension to still care about both AMD and Intel CPUs, but not about ones too old for all the extensions you enabled.
When manually vectorizing with intrinsics, you can only use intrinsics for instruction-sets you've enabled. (Or ones that are on by default, like SSE2 which is baseline for x86-64, and often enabled even with -m32 in modern GCC configs.)
e.g. if you use _mm256_add_epi32, your code won't compile unless you use -mavx2. (Or better, something like -march=haswell or -march=native that enables AVX2, FMA, BMI2, and other stuff modern x86 has, and sets appropriate tuning options.)
The GCC error message in that case is error: inlining failed in call to 'always_inline' '_mm256_loadu_si256': target specific option mismatch.
In GCC terminology, the "target" is the machine you're compiling for. i.e. -mavx2 tells GCC that the target supports AVX2. Thus GCC will make an executable that might use AVX2 instructions anywhere, e.g. for copying a struct or zero-initializing a local array, or otherwise expanding a small constant-size memcpy or memset.
It will also define the CPP macro __AVX2__, so #ifdef __AVX2__ can test whether AVX2 can be assumed at compile-time.
If that's not what you want for the whole program, you need to make sure not to use -mavx2 to compile any code that gets called without a run-time check of CPU features. e.g. put your AVX2 versions of functions in a separate file to compile with -mavx2, or use __attribute__((target("avx2"))). Have your program set function pointers after checking __builtin_cpu_supports("avx2"), or use GCC's ifunc dispatching mechanism to do multi-versioning.
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5
https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html
-m options do not on their own enable auto-vectorization
(Auto-vectorization is not the only way GCC can use SIMD instruction sets.)
-ftree-vectorize (enabled as part of -O3, or even at -O2 in GCC12 and later) is necessary for GCC to auto-vectorize. And/or -fopenmp if the code has some #pragma omp simd. (You definitely always want at least -O2 or -Os if you care about performance; -O3 should be fastest, but may not always be. Sometimes GCC has missed-optimization bugs where -O3 makes things worse, or in large programs it might happen that larger code-size costs more I-cache and I-TLB misses.)
When auto-vectorizing and optimizing in general, GCC will (maybe) use any instruction sets you told it were available (with -m options). So for example, -O3 -march=haswell will auto-vectorize with AVX2 + FMA. -O3 without -m options will just auto-vectorize with SSE2.
e.g. compare on Godbolt GCC -O3 -march=nehalem (SSE4.2) vs. -march=znver2 (AVX2) for summing an integer array. (Compile-time constant size to keep the asm simple).
If you use -O3 -mgeneral-regs-only (the latter option normally only used in kernel code), GCC will still auto-vectorize, but only in cases where it thinks it's profitable to do SWAR (e.g. xor of an array is straightforward using 64-bit integer regs, or even sum of bytes using SWAR bit-hacks to block/correct for carry between bytes)
e.g. gcc -O1 -mavx still just uses scalar code.
Normally if you want full optimization but not auto-vectorization, you'd use something like -O3 -march=znver1 -fno-tree-vectorize
Other compilers
All of the above is true for clang as well, except it doesn't understand -mgeneral-regs-only. (I think you'd need -mno-mmx -mno-sse and maybe other options.)
(The Effect of Architecture When Using SSE / AVX Intrinisics repeats some of this info)
For MSVC / ICC, you can use intrinsics for ISA extensions you haven't told the compiler it can use on its own. So for example, MSVC -O2 without -arch:AVX would let it auto-vectorize with SSE2 (because that's baseline for x86-64), and use movaps for copying around 16-byte structs or whatever.
But with MSVC's style of target options, you can still use SSE4 intrinsics like _mm_cvtepi8_epi32 (pmovsxwd), or even AVX intrinsics without telling the compiler its allowed to use those instructions itself.
Older MSVC used to make really bad asm when you used AVX / AVX2 intrinsics without -arch:AVX, e.g. resulting in mixing VEX and legacy-SSE encodings in the same function (e.g. using the non-VEX encoding for 128-bit intrinsics like _mm_add_ps), and failure to use vzeroupper after 256-bit vectors, both of which were disastrous for performance.
But I think modern MSVC has mostly solved that. Although it still doesn't optimize intrinsics much at all, like not even doing constant-propagation through them.
Not optimizing intrinsics is likely related to MSVC's ability to let you write code like if(avx_supported) { __m256 v = _mm256_load_ps(p); ... and so on. If it was trying to optimize, it would have to keep track of the minimum extension-level already seen along paths of execution that could reach any given intrinsic, so it would know what alternatives would be valid. ICC is like that, too.
For the same reason, GCC can't inline functions with different target options into each other. So you can't use __attribute__((target(""))) to avoid the cost of run-time dispatching; you still want to avoid function-call overhead inside a loop, i.e. make sure there's a loop inside the AVX2 function, otherwise it may not be worth having an AVX2 version, just use the SSE2 version.
I don't know about Intel's new OneAPI compiler, ICX. I think it's based on LLVM, so it might be more like clang.
Currently used gcc 11.3.1 or higher.
I am not programmer but distinguish between C and C++.
I have been producing the latest codecs on github / doom9 forum for three years.
On my old Intel (R) Core (TM) i5-2500K CPU # 3.30GHz I notice that.
In C language you can play SIMD AVX2 ex. assempler codecs for non-SIMD processor. Can we use codecs posted on the forum? Who knows that. Ex. libjpeg, dav1d with SIMD without mavx2.
xeve, xevd, uvg266, uavs3e, uavs3d, aom, libavif
In C++ SIMD AVX2 you won't even open help.
The second thing is thread and compatibility Unix with Windows.
In C this works faster than in C++. Also in C++ you have to add some special untested additions like mingw-std-thread to g++ to get everything working.
Another curiosity about C++.
MSYS2 GCC 12.1.0. Codecs made in AVX2/AVX3 open on old processors. How is it made? I don't know, but not with the functions above.
jpegxl, libwebp2, libheif, jvetvvc, vvenc, vvdec, libraw, jpegls, jpegxt, openhtj2k, openjph, grok(C++20 openjpeg)

Xcode Apple Clang enable avx512

In Xcode(Version 10.1 (10B61)), I used Macro as below to detect AVX512 support.
#ifdef __SSE4_1__
#error "sse4_1"
#endif
#ifdef __AVX__
#error "avx"
#endif
#ifdef __AVX2__
#error "avx2"
#endif
#ifdef __AVX512__
#error "avx512"
#endif
In default Build Settings, SSE4_1 is active, but avx, avx2 and is not. When I add -mavx in Building Settings-->Apple Clang-Custom Compiler Flags-->Other C Flags, that enable AVX, further adding -mavx2 to enable AVX and AVX2, but Unknow argument: '-mavx512'.
How do you enable avx512 and detect it?
It seems like there are few Macro to detect avx512.
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
What's differences between them?
AVX512 isn't a single extension, and doesn't have a specific-enough meaning in this context to be useful. Compilers only deal with specific CPU features, like AVX512F, AVX512DQ, AVX512CD, etc.
All CPUs that support any AVX512 extensions must support AVX512F, the "Foundation". AVX512F is the baseline AVX512 extension that other AVX512 extensions build on.
In code that wants to use AVX512 intrinsics, you should look at https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 and pick a set of extensions that are available together on one CPU you care about, e.g. F + CD and VL, DQ, BW on currently-available Skylake-X.
Then for example use #if defined(__AVX512BW__) && defined(__AVX512VL__) before code that uses vpermt2w on 256-bit vectors or something. __AVX512(anything)__ implies __AVX512F__; that's the one extension you don't have to check for separately.
But if you only used AVX512F instructions, they yeah just check for that macro.
You should pretty much never use -mavx512f directly: use -march=skylake-avx512, -march=knl, or -march=native. Or in future, -march=icelake or whatever.
The compiler knows which CPUs support which sets of extensions (or can detect which extensions the machine you're compiling on supports). There are a lot of them, and leaving out important ones like AVX512VL (support for AVX512 instructions on 128-bit and 256-bit vectors) or Xeon Phi's AVX512ER (fast 1/x and 1/sqrt(x) with twice the precision of the normal AVX512 14-bit versions) could hurt performance significantly. Especially AVX512ER is very important if you do any division or log/exp on Xeon Phi, because full-precision division is very slow on KNL compared to Skylake.
-march=x implies -mtune=x, enabling tuning options relevant for the target as well. KNL is basically Silvermont with AVX512 bolted on, and has significant differences from -mtune=skylake-avx512.
These are the same reasons you should generally not use -mfma -mavx2 directly, except that there are currently no AMD CPUs with AVX512, so there are only 2 main tuning targets (Xeon Phi and mainstream Skylake/CannonLake/Icelake), and they also support different sets of AVX512 extensions. There is unfortunately no -mtune=generic-avx2 tuning setting, but Ryzen supports almost all extensions that Haswell does (and the ones it doesn't GCC / clang won't use automatically, like transactional memory), so -march=haswell might be reasonable to make code tuned for CPUs with FMA, AVX2, popcnt, etc, without suffering too much on Ryzen.
Also relevant (for GCC, maybe not clang currently. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html):
-mprefer-vector-width=256 auto-vectorize with 256-bit vectors by default, in case most of the time is spent in non-vectorized loops. Using 512-bit vectors reduces the max turbo clock speed by a significant amount on Intel Xeon CPUs (maybe not as much on i9 desktop versions of Skylake-X), so it can be a net slowdown to use 512-bit vectors in small scattered bits of your program. So 256 is the default for tune=skylake-avx512 in GCC, but KNL uses 512.
-mprefer-avx-128 the old version of the -mprefer-vector-width= option, before AVX512 existed.
Using AVX512 mask registers, 32 vector registers, and/or its new instructions, can be a significant win even at the same vector width, so it makes sense to enable AVX512 even if you don't want to use 512-bit vector width. (Although sometimes code using intrinsics or auto-vectorization will compile in a worse way, instead of better, if AVX512 compare-into-register versions of comparison are available at all. But hopefully anti-optimization bugs like that will be sorted out as AVX512 becomes more widely used.)

x86 mfence and C++ memory barrier

I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3.
std::atomic<bool> flag {false};
int any_value {0};
void set()
{
any_value = 10;
flag.store(true, std::memory_order_release);
}
void get()
{
while (!flag.load(std::memory_order_acquire));
assert(any_value == 10);
}
int main()
{
std::thread a {set};
get();
a.join();
}
When I use std::memory_order_seq_cst, I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).
However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.
In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?
Below is the assembly code for the get/set functions with -O3, like what we get on the Godbolt compiler explorer:
set():
mov DWORD PTR any_value[rip], 10
mov BYTE PTR flag[rip], 1
ret
.LC0:
.string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
.string "any_value == 10"
get():
.L8:
movzx eax, BYTE PTR flag[rip]
test al, al
je .L8
cmp DWORD PTR any_value[rip], 10
jne .L15
ret
.L15:
push rax
mov ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
mov edx, 17
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC1
call __assert_fail
The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions1, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation2. So there is no need for an MFENCE instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.
On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. MFENCE is required to provide both3 barriers and so it is used at all optimization levels.
Related: Size of store buffers on Intel hardware? What exactly is a store buffer?.
Footnotes:
(1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.
(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3.
(3) See the discussion under Peter's answer.
x86's TSO memory model is sequential-consistency + a store buffer, so only seq-cst stores need any special fencing. (Stalling after a store until the store buffer drains, before later loads, is all we need to recover sequential consistency). The weaker acq/rel model is compatible with the StoreLoad reordering caused by a store buffer.
(See discussion in comments re: whether "allowing StoreLoad reordering" is an accurate and sufficient description of what x86 allows. A core always sees its own stores in program order because loads snoop the store buffer, so you could say that store-forwarding also reorders loads of recently-stored data. Except you can't always: Globally Invisible load instructions)
(And BTW, compilers other than gcc use xchg to do a seq-cst store. This is actually more efficient on current CPUs. GCC's mov+mfence might have been cheaper in the past, but is currently usually worse even if you don't care about the old value. See Why does a std::atomic store with sequential consistency use XCHG? for a comparison between GCC's mov+mfence vs. xchg. Also my answer on Which is a better write barrier on x86: lock+addl or xchgl?)
Fun fact: you can achieve sequential consistency by instead fencing seq-cst loads instead of stores. But cheap loads are much more valuable than cheap stores for most use-cases, so everyone uses ABIs where the full barriers go on the stores.
See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for details of how C++11 atomic ops map to asm instruction sequences for x86, PowerPC, ARMv7, ARMv8, and Itanium. Also When are x86 LFENCE, SFENCE and MFENCE instructions required?
when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used
That's because flag.store(true, std::memory_order_release); doesn't inline, because you disabled optimization. That includes inlining of very simple member functions like atomic::store(T, std::memory_order = std::memory_order_seq_cst)
When the ordering parameter to the __atomic_store_n() GCC builtin is a runtime variable (in the atomic::store() header implementation), GCC plays it conservative and promotes it to seq_cst.
It might actually be worth it for gcc to branch over mfence because it's so expensive, but that's not what we get. (But that would make larger code-size for functions with runtime variable order params, and the code path might not be hot. So branching is probably only a good idea in the libatomic implementation, or with profile-guided optimization for rare cases where a function is large enough to not inline but takes a variable order.)

MingW Windows GCC cant compile c program with 2gb global data

GCC/G++ of MingW gives Relocation Errors when Building Applications with Large Global or Static Data.
Understanding the x64 code models
References to both code and data on x64 are done with
instruction-relative (RIP-relative in x64 parlance) addressing modes.
The offset from RIP in these instructions is limited to 32 bits.
small code model promises to the compiler that 32-bit relative offsets
should be enough for all code and data references in the compiled
object. The large code model, on the other hand, tells it not to make
any assumptions and use absolute 64-bit addressing modes for code and
data references. To make things more interesting, there's also a
middle road, called the medium code model.
For the below example program, despite adding options-mcmodel=medium or -mcmodel=large the code fails to compile
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
int main(){
return 0;
}
gcc -mcmodel=medium example.c fails to compile on MingW/Cygwin Windows, Intel windows /MSVC
You are limited to 32-bits for an offset, but this is a signed offset. So in practice, you are actually limited to 2GiB. You asked why this is not possible, but your array alone is 2GiB in size and there are things in the data segment other than just your array. C is a high level language. You get the ease of just being able to define a main function and you get all of these other things for free -- a standard in and output, etc. The C runtime implements this for you and all of this consumes stack space and room in your data segment. For example, if I build this on x86_64-pc-linux-gnu my .bss size is 0x80000020 in size -- an additional 32 bytes. (I've erased PE information from my brain, so I don't remember how those are laid out.)
I don't remember much about the various machine models, but it's probably helpful to note that the x86_64 instruction set doesn't even contain instructions (that I'm aware of, although I'm not an x86 assembly expert) to access any register-relative address beyond a signed 32-bit value. For example, when you want to cram that much stuff on the stack, gcc has to do weird things like this stack pointer allocation:
movabsq $-10000000016, %r11
addq %r11, %rsp
You can't addq $-10000000016, %rsp because it's more than a signed 32-bit offset. The same applies to RIP-relative addressing:
movq $10000000016(%rip), %rax # No such addressing mode

How to enable alignment exceptions for my process on x64?

I'm curious to see if my 64-bit application suffers from alignment faults.
From Windows Data Alignment on IPF, x86, and x64 archive:
In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT.
On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)
[Ed. emphasis mine]
On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3, we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception.
Ignoring the direction to consult the AMD Architecture Programmer's Manual, i will instead consult the Intel 64 and IA-32 Architectures Software Developer’s Manual
5.10.5 Checking Alignment
When the CPL is 3, alignment of memory references can be checked by setting the
AM flag in the CR0 register and the AC flag in the EFLAGS register. Unaligned memory
references generate alignment exceptions (#AC). The processor does not generate
alignment exceptions when operating at privilege level 0, 1, or 2. See Table 6-7 for a
description of the alignment requirements when alignment checking is enabled.
Excellent. I'm not sure what that means, but excellent.
Then there's also:
2.5 CONTROL REGISTERS
Control registers (CR0, CR1, CR2, CR3, and CR4; see Figure 2-6) determine operating
mode of the processor and the characteristics of the currently executing task.
These registers are 32 bits in all 32-bit modes and compatibility mode.
In 64-bit mode, control registers are expanded to 64 bits. The MOV CRn instructions
are used to manipulate the register bits. Operand-size prefixes for these instructions
are ignored.
The control registers are summarized below, and each architecturally defined control
field in these control registers are described individually. In Figure 2-6, the width of
the register in 64-bit mode is indicated in parenthesis (except for CR0).
CR0 — Contains system control flags that control operating mode and states of
the processor
AM
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking
when set; disables alignment checking when clear. Alignment checking is
performed only when the AM flag is set, the AC flag in the EFLAGS register is
set, CPL is 3, and the processor is operating in either protected or virtual-
8086 mode.
I tried
The language i am actually using is Delphi, but pretend it's language agnostic pseudocode:
void UnmaskAlignmentExceptions()
{
asm
mov rax, cr0; //copy CR0 flags into RAX
or rax, 0x20000; //set bit 18 (AM)
mov cr0, rax; //copy flags back
}
The first instruction
mov rax, cr0;
fails with a Privileged Instruction exception.
How to enable alignment exceptions for my process on x64?
PUSHF
I discovered that the x86 has the instruction:
PUSHF, POPF: Push/pop first 16-bits of EFLAGS on/off the stack
PUSHFD, POPFD: Push/pop all 32-bits of EFLAGS on/off the stack
That then led me to the x64 version:
PUSHFQ, POPFQ: Push/pop the RFLAGS quad on/off the stack
(In 64-bit world the EFLAGS are renamed RFLAGS).
So i wrote:
void EnableAlignmentExceptions;
{
asm
PUSHFQ; //Push RFLAGS quadword onto the stack
POP RAX; //Pop them flags into RAX
OR RAX, $20000; //set bit 18 (AC=Alignment Check) of the flags
PUSH RAX; //Push the modified flags back onto the stack
POPFQ; //Pop the stack back into RFLAGS;
}
And it didn't crash or trigger a protection exception. I have no idea if it does what i want it to.
Bonus Reading
How to catch data-alignment faults on x86 (aka SIGBUS on Sparc) (unrelated question; x86 not x64, Ubunutu not Windows, gcc vs not)
Applications running on x64 have access to a flag register (sometimes referred to as EFLAGS). Bit 18 in this register allows applications to get exceptions when alignment errors occur. So in theory, all a program has to do to enable exceptions for alignment errors is modify the flags register.
However
In order for that to actually work, the operating system kernel must set cr0's bit 18 to allow it. And the Windows operating system doesn't do that. Why not? Who knows?
Applications can not set values in the control register. Only the kernel can do this. Device drivers run inside the kernel, so they can set this too.
It is possible to muck about and try to get this to work by creating a device driver, see:
Old New Thing - Disabling the program crash dialog archive
and the comments that follow. Note that this post is over a decade old, so some of the links are dead.
You might also find this comment (and some of the other answers in this question) to be useful:
Larry Osterman - 07-28-2004 2:22 AM
We actually built a version of NT with alignment exceptions turned on for x86 (you can do that as Skywing mentioned).
We quickly turned it off, because of the number of apps that broke :)
As an alternative to AC for finding slowdowns due to unaligned accesses, you can use hardware performance counter events on Intel CPUs for mem_inst_retired.split_loads and mem_inst_retired.split_stores to find loads/stores that split across a cache-line boundary.
perf record -c 10 -e mem_inst_retired.split_stores,mem_inst_retired.split_loads ./a.out should be useful on Linux. -c 10 records a sample every 10 HW events. If your program does a lot of unaligned accesses and you only want to find the real hotspots, leave it at the default. But -c 10 can get useful data even on a tiny binary that calls printf once. Other perf options like -g to record parent functions on each sample work as usual, and could be useful.
On Windows, use whatever tool you prefer for looking at perf counters. VTune is popular.
Modern Intel CPUs (P6 family and newer) have no penalty for misalignment within a cache line. https://agner.org/optimize/. In fact, such loads/stores are even guaranteed to be atomic (up to 8 bytes), on Intel CPUs. So AC is stricter than necessary, but it will help find potentially-risky accesses that could be page-splits or cache-line splits with differently-aligned data.
AMD CPUs may have penalties for crossing a 16-byte boundary within a 64-byte cache line. I'm not familiar with what hardware counters are available there. Beware that profiling on Intel HW won't necessarily find slowdowns that occur on AMD CPUs, if the offending access never crosses a cache line boundary.
See How can I accurately benchmark unaligned access speed on x86_64? for some details on the penalties, including my testing on 4k-split latency and throughput on Skylake.
See also http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ for possible penalties to store-forwarding efficiency for misaligned loads/stores on Intel/AMD.
Running normal binaries with AC set is not always practical. Compiler-generated code might choose to use an unaligned 8-byte load or store to copy multiple struct members, or to store some literal data.
gcc -O3 -mtune=generic (i.e. the default with optimization enabled) assumes that cache-line splits are cheap enough to be worth the risk of using unaligned accesses instead of multiple narrow accesses like the source does. Page-splits got much cheaper in Skylake, down from ~100 to 150 cycles in Haswell to ~10 cycles in Skylake (about the same penalty as CL splits), because apparently Intel found they were less rare than they previously thought.
Many optimized library functions (like memcpy) use unaligned integer accesses. e.g. glibc's memcpy, for a 6-byte copy, would do 2 overlapping 4-byte loads from the start/end of the buffer, then 2 overlapping stores. (It doesn't have a special case for exactly 6 bytes to do a dword + word, just increasing powers of 2). This comment in the source explains its strategies.
So even if your OS would let you enable AC, you might need a special version of libraries to not trigger AC all over the place for stuff like small memcpy.
SIMD
Alignment when looping sequentially over an array really matters for AVX512, where a vector is the same width as a cache line. If your pointers are misaligned, every access is a cache-line split, not just every other with AVX2. Aligned is always better, but for many algorithms with a decent amount of computation mixed with memory access, it only makes a significant difference with AVX512.
(So with AVX1/2, it's often good to just use unaligned loads, instead of always doing extra work to check alignment and go scalar until an alignment boundary. Especially if your data is usually aligned but you want the function to still work marginally slower in case it isn't.)
Scattered misaligned accesses cross a cache line boundary essentially have twice the cache footprint from touching both lines, if the lines aren't otherwise touched.
Checking for 16, 32 or 64 byte alignment with SIMD is simple in asm: just use [v]movdqa alignment-required loads/stores, or legacy-SSE memory source operands for instructions like paddb xmm0, [rdi]. Instead of vmovdqu or VEX-coded memory source operands like vpaddb xmm0, xmm1, [rdi] which let hardware handle the case of misalignment if/when it occurs.
But in C with intrinsics, some compilers (MSVC and ICC) compile alignment-required intrinsics like _mm_load_si128 into [v]movdqu, never using [v]movdqa, so that's annoying if you actually wanted to use alignment-required loads.
Of course, _mm256_load_si256 or 128 can fold into an AVX memory source operand for vpaddb ymm0, ymm1, [rdi] with any compiler including GCC/clang, same for 128-bit any time AVX and optimization are enabled. But store intrinsics that don't get optimized away entirely do get done with vmovdqa / vmovaps, so at least you can verify store alignment.
To verify load alignment with AVX, you can disable optimization so you'll get separate load / spill into __m256i temporary / reload.
This works in 64-bit Intel CPU. May fail in some AMD
pushfq
bts qword ptr [rsp], 12h ; set AC bit of rflags
popfq
It will not work right away in 32-bit CPUs, these will require first a kernel driver to change the AM bit of CR0 and then
pushfd
bts dword ptr [esp], 12h
popfd

Resources