Using a specific zmm register in inline asm - gcc

Can I tell gcc-style inline assembly to put my __m512i variable into a specific zmm register, like zmm31?

Like on targets where the are no specific-register constraints at all (like ARM), use local register variables to get broad constraints to pick a specific register for asm statements. The compiler can still optimize otherwise, because the only documented guaranteed effect of a register-local is for asm inputs/outputs.
The compiler will prefer the specified register even if there's no asm, though. (So you can write code that appears to work but isn't safe in general with stuff like register int ebx asm("ebx"); return ebx;. GCC documentation is what makes a behaviour guaranteed / future-proof, even if current gcc prefers using the specified register strongly enough to waste instructions when the constraint isn't compatible with the specified register, see below.)
Anyway, this use of register-asm local variables is the only thing they're guaranteed to work for:
#include <immintrin.h>
__m512i foo() {
register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
register __m512i z30 asm("zmm30");
asm("vmovdqa64 %1, %0 # from inline asm"
: "=v"(z30)
: "v"(z31)
);
return z30;
}
On the Godbolt compiler explorer, compiles to this with clang6.0:
# clang -O3 -march=skylake-avx512
vbroadcastss .LCPI0_0(%rip), %zmm31 # zmm31 = [1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43]
vmovdqa64 %zmm31, %zmm30 # from inline asm
vmovaps %zmm30, %zmm0
retq
and gcc8.2:
# gcc -O3 -march=skylake-avx512
foo():
movl $123, %eax
vpbroadcastd %eax, %zmm31
vmovdqa64 %zmm31, %zmm30 # from inline asm
vmovdqa64 %zmm30, %zmm0
ret
Note the "v" constraints which allow any EVEX vector register (0..31), unlike "x" which only allows the first 16. "x" is documented as "any SSE register", but also applies to AVX YMM registers. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
Using "x" for this didn't result in any warnings, but with gcc "x" won vs. the register-variable declaration, so it chose %zmm2 and %zmm1 (strangely not zmm0 so an extra move was required). The register-asm declaration thus did cost us efficiency.
With clang it still used zmm31 and zmm30, apparently violating the "x" constraint, so it would have failed to assemble if you'd used an instruction with no EVEX version on the XMM or YMM part of the register operand, like AVX2 vpcmpeqd ymm,ymm,ymm (compare into vector, not compare into mask). (In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?).
//#ifndef __clang__
__m512i broken_with_clang() {
register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
register __m512i z30 asm("zmm30") = _mm512_setzero_si512();
// notice that gcc still inits these in zmm31 and 30, *then* copies
// so register asm costs us efficiency.
// AVX512 only has compares into k registers, not into YMM registers.
asm("vpcmpeqd %t1, %t0, %t0 # from inline asm. input was %0"
: "+x"(z30)
: "x"(z31)
);
return z30;
}
//#endif
With clang we get an error for each operand; I guess clang doesn't support t modifiers to get the YMM name of the register (because it fails with clang6.0 even if I remove the register ... asm() stuff entirely.)
<source>:21:9: error: invalid operand in inline asm: 'vpcmpeqd ${1:t}, ${0:t}, ${0:t} # from inline asm. input was $0'
asm("vpcmpeqd %t1, %t0, %t0 # from inline asm. input was %0"
^
...
<source>:21:9: error: unknown token in expression
<inline asm>:1:11: note: instantiated into assembly here
vpcmpeqd , , # from inline asm. input was %zmm30
But gcc compiles it just fine:
broken_with_clang():
movl $123, %eax
vpbroadcastd %eax, %zmm31
vpxord %xmm30, %xmm30, %xmm30
vmovdqa64 %zmm30, %zmm1 # extra overhead because of register asm
vmovdqa64 %zmm31, %zmm2 # which didn't match the constraints
vpcmpeqd %ymm2, %ymm1, %ymm1 # from inline asm. input was %zmm1
vmovdqa64 %zmm1, %zmm0 # extra overhead because gcc didn't pick zmm0
ret

Related

How get EIP from x86 inline assembly by gcc

I want to get the value of EIP from the following code, but the compilation does not pass
Command :
gcc -o xxx x86_inline_asm.c -m32 && ./xxx
file contetn x86_inline_asm.c:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
unsigned int eip_val;
__asm__("mov %0,%%eip":"=r"(eip_val));
return 0;
}
How to use the inline assembly to get the value of EIP, and it can be compiled successfully under x86.
How to modify the code and use the command to complete it?
This sounds unlikely to be useful (vs. just taking the address of the whole function like void *tmp = main), but it is possible.
Just get a label address, or use . (the address of the current line), and let the linker worry about getting the right immediate into the machine code. So you're not architecturally reading EIP, just reading the value it currently has from an immediate.
asm volatile("mov $., %0" : "=r"(address_of_mov_instruction) );
AT&T syntax is mov src, dst, so what you wrote would be a jump if it assembled.
(Architecturally, EIP = the end of an instruction while it's executing, so arguably you should do
asm volatile(
"mov $1f, %0 \n\t" // reference label 1 forward
"1:" // GAS local label
"=r"(address_after_mov)
);
I'm using asm volatile in case this asm statement gets duplicated multiple times inside the same function by inlining or something. If you want each case to get a different address, it has to be volatile. Otherwise the compiler can assume that all instances of this asm statement produce the same output. Normally that will be fine.
Architecturally in 32-bit mode you don't have RIP-relative addressing for LEA so the only good way to actually read EIP is call / pop. Reading program counter directly. It's not a general-purpose register so you can't just use it as the source or destination of a mov or any other instruction.
But really you don't need inline asm for this at all.
Is it possible to store the address of a label in a variable and use goto to jump to it? shows how to use the GNU C extension where &&label takes its address.
int foo;
void *addr_inside_function() {
foo++;
lab1: ; // labels only go on statements, not declarations
void *tmp = &&lab1;
foo++;
return tmp;
}
There's nothing you can safely do with this address outside the function; I returned it just as an example to make the compiler put a label in the asm and see what happens. Without a goto to that label, it can still optimize the function pretty aggressively, but you might find it useful as an input for an asm goto(...) somewhere else in the function.
But anyway, it compiles on Godbolt to this asm
# gcc -O3 -m32
addr_inside_function:
.L2:
addl $2, foo
movl $.L2, %eax
ret
#clang -O3 -m32
addr_inside_function:
movl foo, %eax
leal 1(%eax), %ecx
movl %ecx, foo
.Ltmp0: # Block address taken
addl $2, %eax
movl %eax, foo
movl $.Ltmp0, %eax # retval = label address
retl
So clang loads the global, computes foo+1 and stores it, then after the label computes foo+2 and stores that. (Instead of loading twice). So you still can't usefully jump to the label from anywhere, because it depends on having foo's old value in eax, and on the desired behaviour being to store foo+2
I don't know gcc inline assembly syntax for this, but for masm:
call next0
next0: pop eax ;eax = eip for this line
In the case of Masm, $ represents the current location, and since call is a 5 byte instruction, an alternative syntax without a label would be:
call $+5
pop eax

In Clang/LLVM x86-64 inline assembly, how do I say I clobbered the x87/media state?

I'm writing some x86-64 inline assembly that might affect the floating point and media (SSE, MMX, etc.) state, but I don't feel like saving and restoring the state myself. Does Clang/LLVM have a clobber constraint for that?
(I'm not too familiar with the x86-64 architecture or inline assembly, so it was hard to know what to search for. More details in case this is an XY problem: I'm working on a simple coroutine library in Rust. When we switch tasks, we need to store the old CPU state and load the new state, and I'd like to write as little assembly as possible. My guess is that letting the compiler take care of saving and restoring state is the simplest way to do that.)
If your coroutine looks like an opaque (non-inline) function call, the compiler will already assume the FP state is clobbered (except for control regs like MXCSR and the x87 control word (rounding mode)), because all the FP regs are call-clobbered in the normal function calling convention.
Except for Windows, where xmm6..15 are call-preserved.
Also beware that if you're putting a call inside inline asm, there's no way to tell the compiler that your asm clobbers the red zone (128 bytes below RSP in the x86-64 System V ABI). You could compile that file with -mno-redzone or use add rsp, -128 before call to skip over the red-zone that belongs to the compiler-generated code.
To declare clobbers on the FP state, you have to name all the registers separately.
"xmm0", "xmm1", ..., "xmm15" (clobbering xmm0 counts as clobbering ymm0/zmm0).
For good measure you should also name "mm0", ..., "mm7" as well (MMX), in case your code inlines into some legacy code using MMX intrinsics.
To clobber the x87 stack as well, "st" is how you refer to st(0) in the clobber list. The rest of the registers have their normal names for GAS syntax, "st(1)", ..., "st(7)".
https://stackoverflow.com/questions/39728398/how-to-specify-clobbered-bottom-of-the-x87-fpu-stack-with-extended-gcc-assembly
You never know, it is possible to compile withclang -mfpmath=387, or to use 387 vialong double`.
(Hopefully no code uses -mfpmath=387 in 64-bit mode and MMX intrinsics at the same time; the following test-case looks slightly broken with gcc in that case.)
#include <immintrin.h>
float gvar;
int testclobber(float f, char *p)
{
int arg1 = 1, arg2 = 2;
f += gvar; // with -mno-sse, this will be in an x87 register
__m64 mmx_var = *(const __m64*)p; // MMX
mmx_var = _mm_unpacklo_pi8(mmx_var, mmx_var);
// x86-64 System V calling convention
unsigned long long retval;
asm volatile ("add $-128, %%rsp \n\t" // skip red zone. -128 fits in an imm8
"call whatever \n\t"
"sub $-128, %%rsp \n\t"
// FIXME should probably align the stack in here somewhere
: "=a"(retval) // returns in RAX
: "D" (arg1), "S" (arg2) // input args in registers
: "rcx", "rdx", "r8", "r9", "r10", "r11" // call-clobbered integer regs
// call clobbered FP regs, *NOT* including MXCSR
, "mm0", "mm1", "mm2", "mm3", "mm4", "mm5", "mm6", "mm7" // MMX
, "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)" // x87
// SSE/AVX: clobbering any results in a redundant vzeroupper with gcc?
, "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"
, "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15"
#ifdef __AVX512F__
, "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23"
, "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31"
, "k0", "k1", "k2", "k3", "k4", "k5", "k6", "k7"
#endif
#ifdef __MPX__
, "bnd0", "bnd1", "bnd2", "bnd3"
#endif
, "memory" // reads/writes of globals and pointed-to data can't reorder across the asm (at compile time; runtime StoreLoad reordering is still a thing)
);
// Use the MMX var after the asm: compiler has to spill/reload the reg it was in
*(__m64*)p = mmx_var;
_mm_empty(); // emms
gvar = f; // memory clobber prevents hoisting this ahead of the asm.
return retval;
}
source + asm on the Godbolt compiler explorer
By commenting one of the lines of clobbers, we can see that the spill-reload go away in the asm. e.g. commenting the x87 st .. st(7) clobbers makes code that leaves f + gvar in st0, for just a fst dword [gvar] after the call.
Similarly, commenting the mm0 line lets gcc and clang keep mmx_var in mm0 across the call. The ABI requires that the FPU is in x87 mode, not MMX, on call / ret, this isn't really sufficient. The compiler will spill/reload around the asm, but it won't insert an emms for us. But by the same token, it would be an error for a function using MMX to call your co-routine without doing _mm_empty() first, so maybe this isn't a real problem.
I haven't experimented with __m256 variables to see if it inserts a vzeroupper before the asm, to avoid possible SSE/AVX slowdowns.
If we comment the xmm8..15 line, we see the version that isn't using x87 for float keeps it in xmm8, because now it thinks it has some non-clobbered xmm regs. If we comment both sets of lines, it assumes xmm0 lives across the asm, so this works as a test of the clobbers.
asm output with all clobbers in place
It saves/restores RBX (to hold the pointer arg across the asm statement), which happens to re-align the stack by 16. That's another problem with using call from inline asm: I don't think alignment of RSP is guaranteed.
# from clang7.0 -march=skylake-avx512 -mmpx
testclobber: # #testclobber
push rbx
vaddss xmm0, xmm0, dword ptr [rip + gvar]
vmovss dword ptr [rsp - 12], xmm0 # 4-byte Spill (because of xmm0..15 clobber)
mov rbx, rdi # save pointer for after asm
movq mm0, qword ptr [rdi]
punpcklbw mm0, mm0 # mm0 = mm0[0,0,1,1,2,2,3,3]
movq qword ptr [rsp - 8], mm0 # 8-byte Spill (because of mm0..7 clobber)
mov edi, 1
mov esi, 2
add rsp, -128
call whatever
sub rsp, -128
movq mm0, qword ptr [rsp - 8] # 8-byte Reload
movq qword ptr [rbx], mm0
emms # note this didn't happen before call
vmovss xmm0, dword ptr [rsp - 12] # 4-byte Reload
vmovss dword ptr [rip + gvar], xmm0
pop rbx
ret
Notice that because of the "memory" clobber in the asm statement, *p and gvar are read before the asm, but written after. Without that, the optimizer could sink the load or hoist the store so no local variable was live across the asm statement. But now the optimizer needs to assume that the asm statement itself might read the old value of gvar and/or modify it. (And assume that p points to memory that's also globally accessible somehow, because we didn't use __restrict.)

How to have GCC combine "move r10, r3; store r10" into a "store r3"?

I'm working Power9 and utilizing the hardware random number generator instruction called DARN. I have the following inline assembly:
uint64_t val;
__asm__ __volatile__ (
"xor 3,3,3 \n" // r3 = 0
"addi 4,3,-1 \n" // r4 = -1, failure
"1: \n"
".byte 0xe6, 0x05, 0x61, 0x7c \n" // r3 = darn 3, 1
"cmpd 3,4 \n" // r3 == -1?
"beq 1b \n" // retry on failure
"mr %0,3 \n" // val = r3
: "=g" (val) : : "r3", "r4", "cc"
);
I had to add a mr %0,3 with "=g" (val) because I could not get GCC to produce expected code with "=r3" (val). Also see Error: matching constraint not valid in output operand.
A disassembly shows:
(gdb) b darn.cpp : 36
(gdb) r v
...
Breakpoint 1, DARN::GenerateBlock (this=<optimized out>,
output=0x7fffffffd990 "\b", size=0x100) at darn.cpp:77
77 DARN64(output+i*8);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.ppc64le libgcc-4.8.5-28.el7_5.1.ppc64le libstdc++-4.8.5-28.el7_5.1.ppc64le
(gdb) disass
Dump of assembler code for function DARN::GenerateBlock(unsigned char*, unsigned long):
...
0x00000000102442b0 <+48>: addi r10,r8,-8
0x00000000102442b4 <+52>: rldicl r10,r10,61,3
0x00000000102442b8 <+56>: addi r10,r10,1
0x00000000102442bc <+60>: mtctr r10
=> 0x00000000102442c0 <+64>: xor r3,r3,r3
0x00000000102442c4 <+68>: addi r4,r3,-1
0x00000000102442c8 <+72>: darn r3,1
0x00000000102442cc <+76>: cmpd r3,r4
0x00000000102442d0 <+80>: beq 0x102442c8 <DARN::GenerateBlock(unsigned char*, unsigned long)+72>
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
Notice GCC faithfully reproduces the:
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
How do I get GCC to fold the two instructions into:
0x00000000102442d8 <+84>: stdu r3,8(r9)
GCC will never remove text that's part of the asm template; it doesn't even parse it other than substituting in for %operand. It's literally just a text substitution before the asm is sent to the assembler.
You have to leave out the mr from your inline asm template, and tell gcc that your output is in r3 (or use a memory-destination output operand, but don't do that). If your inline-asm template ever starts or ends with mov instructions, you're usually doing it wrong.
Use register uint64_t foo asm("r3"); to force "=r"(foo) to pick r3 on platforms that don't have specific-register constraints.
(Despite ISO C++17 removing the register keyword, this GNU extension still works with -std=c++17. You can also use register uint64_t foo __asm__("r3"); if you want to avoid the asm keyword. You probably still need to treat register as a reserved word in source that uses this extension; that's fine. ISO C++ removing it from the base language doesn't force implementations to not use it as part of an extension.)
Or better, don't hard-code a register number. Use an assembler that supports the DARN instruction. (But apparently it's so new that even up-to-date clang lacks it, and you'd only want this inline asm as a fallback for gcc too old to support the __builtin_darn() intrinsic)
Using these constraints will let you remove the register setup, too, and use foo=0 / bar=-1 before the inline asm statement, and use "+r"(foo).
But note that darn's output register is write-only. There's no need to zero r3 first. I found a copy of IBM's POWER ISA instruction set manual that is new enough to include darn here: https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf#page=96
In fact, you don't need to loop inside the asm at all, you can leave that to the C and only wrap the one asm instruction, like inline-asm is designed for.
uint64_t random_asm() {
register uint64_t val asm("r3");
do {
//__asm__ __volatile__ ("darn 3, 1");
__asm__ __volatile__ (".byte 0x7c, 0x61, 0x05, 0xe6 # gcc asm operand = %0\n" : "=r" (val));
} while(val == -1ULL);
return val;
}
compiles cleanly (on the Godbolt compiler explorer) to
random_asm():
.L6: # compiler-generated label, no risk of name clashes
.byte 0x7c, 0x61, 0x05, 0xe6 # gcc asm operand = 3
cmpdi 7,3,-1 # compare-immediate
beq 7,.L6
blr
Just as tight as your loop, with less setup. (Are you sure you even need to zero r3 before the asm instruction?)
This function can inline anywhere you want it to, allowing gcc to emit a store instruction that reads r3 directly.
In practice, you'll want to use a retry counter, as advised in the manual: if the hardware RNG is broken, it might give you failure forever so you should have a fallback to a PRNG. (Same for x86's rdrand)
Deliver A Random Number (darn) - Programming Note
When the error value is obtained, software is
expected to repeat the operation. If a non-error
value has not been obtained after several attempts,
a software random number generation method
should be used. The recommended number of
attempts may be implementation specific. In the
absence of other guidance, ten attempts should be
adequate.
xor-zeroing is not efficient on most fixed-instruction-width ISAs, because a mov-immediate is just as short so there's no need to detect and special-case an xor. (And thus CPU designs don't spend transistors on it). Moreover, dependency rules for the PPC asm equivalent of C++11 std::memory_order_consume require it to carry a dependency on the input register, so it couldn't be dependency-breaking even if the designers wanted it to. xor-zeroing is only a thing on x86 and maybe a few other variable-width ISAs.
Use li r3, 0 like gcc does for int foo(){return 0;} https://godbolt.org/z/-gHI4C.

Is it necessary to initialize all the used registers in inline assembly?

I am testing simple inline assembly code using gcc. And I find the result of the following code unexpected:
#include <stdio.h>
int main(void) {
unsigned x0 = 0, x1 = 1, x2 = 2;
__asm__ volatile("movl %1, %0;\n\t"
"movl %2, %1"
:"=r"(x0), "+r"(x1)
:"r"(x2)
:);
printf("%u, %u\n", x0, x1);
return 0;
}
The printed result is 1, 1, rather than the expected 1, 2. Then I compiled the code with -S option and found out gcc generated the code as
movl %eax, %edx;
movl %edx, %eax;
%0 and %2 are using the same register, why?
I want gcc to generate, say,
movl %eax, %edx;
movl %ecx, %eax;
If I add "0"(x1) to the input constraints, gcc will generate the code above. Does it mean that all registers need to be initialized before being used in inline assembly?
Moving my comment to an 'Answer' so this question can be closed.
To prevent the compiler from re-using a register for both an input and an output, you can use the early clobber constraint (for example =&r (x)), which informs the compiler that the register associated with the parameter is
written before the instruction is finished using the input operands.
While this can be a good thing (since it reduces the number of registers that must made available before calling your asm), it can also cause problems (as you have seen). So, either make sure you have finished using all the inputs before writing to the output, or use & to tell the compiler not to do this optimization.
For completeness, let me also point out that using inline asm is usually a bad idea.

Use both SSE2 intrinsics and gcc inline assembler

I have tried to mix SSE2 intrinsics and inline assembler in gcc. But if I specify a variable as xmm0/register as input then in some cases I get a compiler error. Example:
#include <emmintrin.h>
int main() {
__m128i test = _mm_setzero_si128();
asm ("pxor %%xmm0, %%xmm0" : : "xmm0" (test) : );
}
When compiled with gcc version 4.6.1 I get:
>gcc asm_xmm.c
asm_xmm.c: In function ‘main’:
asm_xmm.c:10:3: error: matching constraint references invalid operand number
asm_xmm.c:7:5: error: matching constraint references invalid operand number
The strange thing is that in same cases where I have other input variables/registers then it suddenly works with xmm0 as input but not xmm1, etc. And in another case I was able to specify xmm0-xmm4 but not above. A little confused/frustrated about this :S
Thanks :)
You should let the compiler do the register assignment. Here's an example of pshufb (for gcc too old to have tmmintrin for SSSE3):
static inline __m128i __attribute__((always_inline))
_mm_shuffle_epi8(__m128i xmm, __m128i xmm_shuf)
{
__asm__("pshufb %1, %0" : "+x" (xmm) : "xm" (xmm_shuf));
return xmm;
}
Note the "x" qualifier on the arguments and simply %0 in the assembly itself, where the compiler will substitute in the register it selected.
Be careful to use the right modifiers. "+x" means xmm is both an input and an output parameter. If you are sloppy with these modifiers (eg using "=x" meaning output only when you needed "+x") you will run into cases where it sometimes works and sometimes doesn't.

Resources