Does there exist an equivalent of the x86 PAUSE instruction, which is placed within busy waiting loops to improve performance, particularly on SMT machines, on PowerPC?
In the Linux kernel we have this in arch/powerpc/include/asm/processor.h
/* Macros for adjusting thread priority (hardware multi-threading) */
#define HMT_very_low() asm volatile("or 31,31,31 # very low priority")
#define HMT_low() asm volatile("or 1,1,1 # low priority")
#define HMT_medium_low() asm volatile("or 6,6,6 # medium low priority")
#define HMT_medium() asm volatile("or 2,2,2 # medium priority")
#define HMT_medium_high() asm volatile("or 5,5,5 # medium high priority")
#define HMT_high() asm volatile("or 3,3,3 # high priority")
I'm not familiar with x86 PAUSE, but sounds like "or 31,31,31" is what you want.
Which powerpc processor are you doing this on? For SMT it must be POWER5, 6 or 7?
The PowerPC inside of Cell will recognize certain NOP encodings as indication to adjust the relative priority of the two physical threads in the core. The documentation lists the cctpl, cctpm, and cctph extended mnemonics for these special NOPs.
From the look of other Google results, it looks like perhaps the IBM RS64 line had similar special NOP instructions, so this functionality has probably been in "Book IV" of various IBM PowerPC chips for quite some time.
The Power ISA 2.06 document has additional special NOP definitions in chapter 3 with extended mnemonics like yield, mdoio, and mdoom. It also defines the same NOPs as cctpl and cctpm from Cell.
Related
The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead.
Since not all compilers are made the same, this hides bugs. It would be nice to be able to be able to turn the actual aligned operations on, even though the performance hit that used to be there doesn't seem to be there anymore.
In other words, writing code:
__m128 p1 = _mm_load_ps(data);
currently produces:
movups xmm0,xmmword ptr [eax]
expected result:
movaps xmm0,xmmword ptr [eax]
(I was asked by microsoft to ask here)
MSVC and ICC only use instructions that do alignment checking when they fold a load into a memory source operand without AVX enabled, like addps xmm0, [rax]. SSE memory source operands require alignment, unlike AVX. But you can't reliably control when this happens, and in debug builds it generally doesn't.
As Mysticial points out in Visual Studio 2017: _mm_load_ps often compiled to movups , another case is NT load/store, because there is no unaligned version.
If your code is compatible with clang-cl, have Visual Studio use it instead of MSVC. It's a modified version of clang that tries to act more like MSVC. But like GCC, clang uses aligned load and store instructions for aligned intrinsics.
Either disable optimization, or make sure AVX is not enabled, otherwise it could fold a _mm_load_ps into a memory source operand like vaddps xmm0, [rax] which doesn't require alignment because it's the AVX version. This may be a problem if your code also uses AVX intrinsics in the same file, because clang requires that you enable ISA extensions for intrinsics you want to use; the compiler won't emit asm instructions for an extension that isn't enabled, even with intrinsics. Unlike MSVC and ICC.
A debug build should work even with AVX enabled, especially if you _mm_load_ps or _mm256_load_ps into a separate variable in a separate statement, not v=_mm_add_ps(v, _mm_load_ps(ptr));
With MSVC itself, for debugging purposes only (usually very big speed penalty for stores), you could substitute normal loads/stores with NT. Since they're special, the compiler won't fold loads into memory source operands for ALU instructions, so this can maybe work even with AVX with optimization enabled.
// alignment_debug.h (untested)
// #include this *after* immintrin.h
#ifdef DEBUG_SIMD_ALIGNMENT
#warn "using slow alignment-debug SIMD instructions to work around MSVC/ICC limitations"
// SSE4.1 MOVNTDQA doesn't do anything special on normal WB memory, only WC
// On WB, it's just a slower MOVDQA, wasting an ALU uop.
#define _mm_load_si128 _mm_stream_load_si128
#define _mm_load_ps(ptr) _mm_castsi128_ps(_mm_stream_load_si128((const __m128i*)ptr))
#define _mm_load_pd(ptr) _mm_castsi128_pd(_mm_stream_load_si128((const __m128i*)ptr))
// SSE1/2 MOVNTPS / PD / MOVNTDQ evict data from cache if it was hot, and bypass cache
#define _mm_store_ps _mm_stream_ps // SSE1 movntps
#define _mm_store_pd _mm_stream_pd // SSE2 movntpd is a waste of space vs. the ps encoding, but whatever
#define _mm_store_si128 _mm_stream_si128 // SSE2 movntdq
// and repeat for _mm256_... versions with _mm256_castsi256_ps
// and _mm512_... versions
// edit welcome if anyone tests this and adds those versions
#endif
Related: for auto-vectorization with MSVC (and gcc/clang), see Alex's answer on Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang
If I have a chip that is subject to the Intel jcc erratum, how I can enable the mitigation in gcc (which adjusts branch locations to avoid the problematic alignment), and which gcc versions support it?
By compiler:
GCC: -Wa,-mbranches-within-32B-boundaries
clang (10+): -mbranches-within-32B-boundaries compiler option directly, not -Wa.
MSVC: /QIntel-jcc-erratum See Intel JCC Erratum - what is the effect of prefixes used for mitigation?
ICC: TODO, look for docs.
The GNU toolchain does mitigation in the assembler, with as -mbranches-within-32B-boundaries, which enables (GAS manual: x86 options):
-malign-branch-boundary=32 (care about 32-byte boundaries). Except the manual says this option takes an exponent, not a direct power of 2, so probably it's actually ...boundary=5.
-malign-branch=jcc+fused+jmp (the default which does not include any of +call+ret+indirect)
-malign-branch-prefix-size=5 (up to 5 segment prefixes per insn).
So the relevant GCC invocation is gcc -Wa,-mbranches-within-32B-boundaries
Unfortunately, GCC -mtune=skylake doesn't enable this.
GAS's strategy seems to be to pad as early as possible after the last alignment directive (e.g. .p2align) or after the last jcc/jmp that can end before a 32B boundary. I guess that might end up with padding in outer loops, before or after inner loops, maybe helping them fit in fewer uop cache lines? (Skylake also has its LSD loop buffer disabled, so a tiny loop split across two uop cache lines can run at best 2 cycles per iteration, instead of 1.)
It can lead to quite a large amount of padding with long macro-fused jumps, such as with -fstack-protector-strong which in recent GCC uses sub rdx,QWORD PTR fs:0x28 / jnz (earlier GCC used to use xor, which can't fuse even on Intel). That's 11 bytes total of sub + jnz, so could require 11 bytes of CS prefixes in the worst case to shift it to the start of a new 32B block. Example showing 8 CS prefixes in the insns before it: https://godbolt.org/z/n1dYGMdro
GCC doesn't know instruction sizes, it only prints text. That's why it needs GAS to support stuff like .p2align 4,,10 to align by 16 if that will take fewer than 10 bytes of padding, to implement the alignment heuristics it wants to use. (Often followed by .p2align 3 to unconditionally align by 8.)
as has other fun options that aren't on by default, like -Os to optimize hand-written asm like mov $1, %rax => mov $1, %eax / xor %rax,%rax => %eax / test $1, %eax => al and even EVEX => VEX for stuff like vmovdqa64 => vmovdqa.
Also stuff like -msse2avx to always use VEX prefixes even when the mnemonic isn't v..., and -momit-lock-prefix=yes which could be used to build std::atomic code for a uniprocessor system.
And -mfence-as-lock-add=yes to assemble mfence into lock addl $0x0, (%rsp). But insanely it also does that for sfence and even lfence, so it's unusable in code that uses lfence as an execution barrier, which is the primary use-case for lfence. e.g. for retpolines or timing like lfence;rdtsc.
as also has CPU feature-level checking with -march=znver3 for example, or .arch directives. And -mtune=CPU, although IDK what that does. Perhaps set NOP strategy?
In Xcode(Version 10.1 (10B61)), I used Macro as below to detect AVX512 support.
#ifdef __SSE4_1__
#error "sse4_1"
#endif
#ifdef __AVX__
#error "avx"
#endif
#ifdef __AVX2__
#error "avx2"
#endif
#ifdef __AVX512__
#error "avx512"
#endif
In default Build Settings, SSE4_1 is active, but avx, avx2 and is not. When I add -mavx in Building Settings-->Apple Clang-Custom Compiler Flags-->Other C Flags, that enable AVX, further adding -mavx2 to enable AVX and AVX2, but Unknow argument: '-mavx512'.
How do you enable avx512 and detect it?
It seems like there are few Macro to detect avx512.
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
What's differences between them?
AVX512 isn't a single extension, and doesn't have a specific-enough meaning in this context to be useful. Compilers only deal with specific CPU features, like AVX512F, AVX512DQ, AVX512CD, etc.
All CPUs that support any AVX512 extensions must support AVX512F, the "Foundation". AVX512F is the baseline AVX512 extension that other AVX512 extensions build on.
In code that wants to use AVX512 intrinsics, you should look at https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 and pick a set of extensions that are available together on one CPU you care about, e.g. F + CD and VL, DQ, BW on currently-available Skylake-X.
Then for example use #if defined(__AVX512BW__) && defined(__AVX512VL__) before code that uses vpermt2w on 256-bit vectors or something. __AVX512(anything)__ implies __AVX512F__; that's the one extension you don't have to check for separately.
But if you only used AVX512F instructions, they yeah just check for that macro.
You should pretty much never use -mavx512f directly: use -march=skylake-avx512, -march=knl, or -march=native. Or in future, -march=icelake or whatever.
The compiler knows which CPUs support which sets of extensions (or can detect which extensions the machine you're compiling on supports). There are a lot of them, and leaving out important ones like AVX512VL (support for AVX512 instructions on 128-bit and 256-bit vectors) or Xeon Phi's AVX512ER (fast 1/x and 1/sqrt(x) with twice the precision of the normal AVX512 14-bit versions) could hurt performance significantly. Especially AVX512ER is very important if you do any division or log/exp on Xeon Phi, because full-precision division is very slow on KNL compared to Skylake.
-march=x implies -mtune=x, enabling tuning options relevant for the target as well. KNL is basically Silvermont with AVX512 bolted on, and has significant differences from -mtune=skylake-avx512.
These are the same reasons you should generally not use -mfma -mavx2 directly, except that there are currently no AMD CPUs with AVX512, so there are only 2 main tuning targets (Xeon Phi and mainstream Skylake/CannonLake/Icelake), and they also support different sets of AVX512 extensions. There is unfortunately no -mtune=generic-avx2 tuning setting, but Ryzen supports almost all extensions that Haswell does (and the ones it doesn't GCC / clang won't use automatically, like transactional memory), so -march=haswell might be reasonable to make code tuned for CPUs with FMA, AVX2, popcnt, etc, without suffering too much on Ryzen.
Also relevant (for GCC, maybe not clang currently. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html):
-mprefer-vector-width=256 auto-vectorize with 256-bit vectors by default, in case most of the time is spent in non-vectorized loops. Using 512-bit vectors reduces the max turbo clock speed by a significant amount on Intel Xeon CPUs (maybe not as much on i9 desktop versions of Skylake-X), so it can be a net slowdown to use 512-bit vectors in small scattered bits of your program. So 256 is the default for tune=skylake-avx512 in GCC, but KNL uses 512.
-mprefer-avx-128 the old version of the -mprefer-vector-width= option, before AVX512 existed.
Using AVX512 mask registers, 32 vector registers, and/or its new instructions, can be a significant win even at the same vector width, so it makes sense to enable AVX512 even if you don't want to use 512-bit vector width. (Although sometimes code using intrinsics or auto-vectorization will compile in a worse way, instead of better, if AVX512 compare-into-register versions of comparison are available at all. But hopefully anti-optimization bugs like that will be sorted out as AVX512 becomes more widely used.)
GCC/G++ of MingW gives Relocation Errors when Building Applications with Large Global or Static Data.
Understanding the x64 code models
References to both code and data on x64 are done with
instruction-relative (RIP-relative in x64 parlance) addressing modes.
The offset from RIP in these instructions is limited to 32 bits.
small code model promises to the compiler that 32-bit relative offsets
should be enough for all code and data references in the compiled
object. The large code model, on the other hand, tells it not to make
any assumptions and use absolute 64-bit addressing modes for code and
data references. To make things more interesting, there's also a
middle road, called the medium code model.
For the below example program, despite adding options-mcmodel=medium or -mcmodel=large the code fails to compile
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
int main(){
return 0;
}
gcc -mcmodel=medium example.c fails to compile on MingW/Cygwin Windows, Intel windows /MSVC
You are limited to 32-bits for an offset, but this is a signed offset. So in practice, you are actually limited to 2GiB. You asked why this is not possible, but your array alone is 2GiB in size and there are things in the data segment other than just your array. C is a high level language. You get the ease of just being able to define a main function and you get all of these other things for free -- a standard in and output, etc. The C runtime implements this for you and all of this consumes stack space and room in your data segment. For example, if I build this on x86_64-pc-linux-gnu my .bss size is 0x80000020 in size -- an additional 32 bytes. (I've erased PE information from my brain, so I don't remember how those are laid out.)
I don't remember much about the various machine models, but it's probably helpful to note that the x86_64 instruction set doesn't even contain instructions (that I'm aware of, although I'm not an x86 assembly expert) to access any register-relative address beyond a signed 32-bit value. For example, when you want to cram that much stuff on the stack, gcc has to do weird things like this stack pointer allocation:
movabsq $-10000000016, %r11
addq %r11, %rsp
You can't addq $-10000000016, %rsp because it's more than a signed 32-bit offset. The same applies to RIP-relative addressing:
movq $10000000016(%rip), %rax # No such addressing mode
I'm curious to see if my 64-bit application suffers from alignment faults.
From Windows Data Alignment on IPF, x86, and x64 archive:
In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT.
On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)
[Ed. emphasis mine]
On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3, we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception.
Ignoring the direction to consult the AMD Architecture Programmer's Manual, i will instead consult the Intel 64 and IA-32 Architectures Software Developer’s Manual
5.10.5 Checking Alignment
When the CPL is 3, alignment of memory references can be checked by setting the
AM flag in the CR0 register and the AC flag in the EFLAGS register. Unaligned memory
references generate alignment exceptions (#AC). The processor does not generate
alignment exceptions when operating at privilege level 0, 1, or 2. See Table 6-7 for a
description of the alignment requirements when alignment checking is enabled.
Excellent. I'm not sure what that means, but excellent.
Then there's also:
2.5 CONTROL REGISTERS
Control registers (CR0, CR1, CR2, CR3, and CR4; see Figure 2-6) determine operating
mode of the processor and the characteristics of the currently executing task.
These registers are 32 bits in all 32-bit modes and compatibility mode.
In 64-bit mode, control registers are expanded to 64 bits. The MOV CRn instructions
are used to manipulate the register bits. Operand-size prefixes for these instructions
are ignored.
The control registers are summarized below, and each architecturally defined control
field in these control registers are described individually. In Figure 2-6, the width of
the register in 64-bit mode is indicated in parenthesis (except for CR0).
CR0 — Contains system control flags that control operating mode and states of
the processor
AM
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking
when set; disables alignment checking when clear. Alignment checking is
performed only when the AM flag is set, the AC flag in the EFLAGS register is
set, CPL is 3, and the processor is operating in either protected or virtual-
8086 mode.
I tried
The language i am actually using is Delphi, but pretend it's language agnostic pseudocode:
void UnmaskAlignmentExceptions()
{
asm
mov rax, cr0; //copy CR0 flags into RAX
or rax, 0x20000; //set bit 18 (AM)
mov cr0, rax; //copy flags back
}
The first instruction
mov rax, cr0;
fails with a Privileged Instruction exception.
How to enable alignment exceptions for my process on x64?
PUSHF
I discovered that the x86 has the instruction:
PUSHF, POPF: Push/pop first 16-bits of EFLAGS on/off the stack
PUSHFD, POPFD: Push/pop all 32-bits of EFLAGS on/off the stack
That then led me to the x64 version:
PUSHFQ, POPFQ: Push/pop the RFLAGS quad on/off the stack
(In 64-bit world the EFLAGS are renamed RFLAGS).
So i wrote:
void EnableAlignmentExceptions;
{
asm
PUSHFQ; //Push RFLAGS quadword onto the stack
POP RAX; //Pop them flags into RAX
OR RAX, $20000; //set bit 18 (AC=Alignment Check) of the flags
PUSH RAX; //Push the modified flags back onto the stack
POPFQ; //Pop the stack back into RFLAGS;
}
And it didn't crash or trigger a protection exception. I have no idea if it does what i want it to.
Bonus Reading
How to catch data-alignment faults on x86 (aka SIGBUS on Sparc) (unrelated question; x86 not x64, Ubunutu not Windows, gcc vs not)
Applications running on x64 have access to a flag register (sometimes referred to as EFLAGS). Bit 18 in this register allows applications to get exceptions when alignment errors occur. So in theory, all a program has to do to enable exceptions for alignment errors is modify the flags register.
However
In order for that to actually work, the operating system kernel must set cr0's bit 18 to allow it. And the Windows operating system doesn't do that. Why not? Who knows?
Applications can not set values in the control register. Only the kernel can do this. Device drivers run inside the kernel, so they can set this too.
It is possible to muck about and try to get this to work by creating a device driver, see:
Old New Thing - Disabling the program crash dialog archive
and the comments that follow. Note that this post is over a decade old, so some of the links are dead.
You might also find this comment (and some of the other answers in this question) to be useful:
Larry Osterman - 07-28-2004 2:22 AM
We actually built a version of NT with alignment exceptions turned on for x86 (you can do that as Skywing mentioned).
We quickly turned it off, because of the number of apps that broke :)
As an alternative to AC for finding slowdowns due to unaligned accesses, you can use hardware performance counter events on Intel CPUs for mem_inst_retired.split_loads and mem_inst_retired.split_stores to find loads/stores that split across a cache-line boundary.
perf record -c 10 -e mem_inst_retired.split_stores,mem_inst_retired.split_loads ./a.out should be useful on Linux. -c 10 records a sample every 10 HW events. If your program does a lot of unaligned accesses and you only want to find the real hotspots, leave it at the default. But -c 10 can get useful data even on a tiny binary that calls printf once. Other perf options like -g to record parent functions on each sample work as usual, and could be useful.
On Windows, use whatever tool you prefer for looking at perf counters. VTune is popular.
Modern Intel CPUs (P6 family and newer) have no penalty for misalignment within a cache line. https://agner.org/optimize/. In fact, such loads/stores are even guaranteed to be atomic (up to 8 bytes), on Intel CPUs. So AC is stricter than necessary, but it will help find potentially-risky accesses that could be page-splits or cache-line splits with differently-aligned data.
AMD CPUs may have penalties for crossing a 16-byte boundary within a 64-byte cache line. I'm not familiar with what hardware counters are available there. Beware that profiling on Intel HW won't necessarily find slowdowns that occur on AMD CPUs, if the offending access never crosses a cache line boundary.
See How can I accurately benchmark unaligned access speed on x86_64? for some details on the penalties, including my testing on 4k-split latency and throughput on Skylake.
See also http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ for possible penalties to store-forwarding efficiency for misaligned loads/stores on Intel/AMD.
Running normal binaries with AC set is not always practical. Compiler-generated code might choose to use an unaligned 8-byte load or store to copy multiple struct members, or to store some literal data.
gcc -O3 -mtune=generic (i.e. the default with optimization enabled) assumes that cache-line splits are cheap enough to be worth the risk of using unaligned accesses instead of multiple narrow accesses like the source does. Page-splits got much cheaper in Skylake, down from ~100 to 150 cycles in Haswell to ~10 cycles in Skylake (about the same penalty as CL splits), because apparently Intel found they were less rare than they previously thought.
Many optimized library functions (like memcpy) use unaligned integer accesses. e.g. glibc's memcpy, for a 6-byte copy, would do 2 overlapping 4-byte loads from the start/end of the buffer, then 2 overlapping stores. (It doesn't have a special case for exactly 6 bytes to do a dword + word, just increasing powers of 2). This comment in the source explains its strategies.
So even if your OS would let you enable AC, you might need a special version of libraries to not trigger AC all over the place for stuff like small memcpy.
SIMD
Alignment when looping sequentially over an array really matters for AVX512, where a vector is the same width as a cache line. If your pointers are misaligned, every access is a cache-line split, not just every other with AVX2. Aligned is always better, but for many algorithms with a decent amount of computation mixed with memory access, it only makes a significant difference with AVX512.
(So with AVX1/2, it's often good to just use unaligned loads, instead of always doing extra work to check alignment and go scalar until an alignment boundary. Especially if your data is usually aligned but you want the function to still work marginally slower in case it isn't.)
Scattered misaligned accesses cross a cache line boundary essentially have twice the cache footprint from touching both lines, if the lines aren't otherwise touched.
Checking for 16, 32 or 64 byte alignment with SIMD is simple in asm: just use [v]movdqa alignment-required loads/stores, or legacy-SSE memory source operands for instructions like paddb xmm0, [rdi]. Instead of vmovdqu or VEX-coded memory source operands like vpaddb xmm0, xmm1, [rdi] which let hardware handle the case of misalignment if/when it occurs.
But in C with intrinsics, some compilers (MSVC and ICC) compile alignment-required intrinsics like _mm_load_si128 into [v]movdqu, never using [v]movdqa, so that's annoying if you actually wanted to use alignment-required loads.
Of course, _mm256_load_si256 or 128 can fold into an AVX memory source operand for vpaddb ymm0, ymm1, [rdi] with any compiler including GCC/clang, same for 128-bit any time AVX and optimization are enabled. But store intrinsics that don't get optimized away entirely do get done with vmovdqa / vmovaps, so at least you can verify store alignment.
To verify load alignment with AVX, you can disable optimization so you'll get separate load / spill into __m256i temporary / reload.
This works in 64-bit Intel CPU. May fail in some AMD
pushfq
bts qword ptr [rsp], 12h ; set AC bit of rflags
popfq
It will not work right away in 32-bit CPUs, these will require first a kernel driver to change the AM bit of CR0 and then
pushfd
bts dword ptr [esp], 12h
popfd