Use both SSE2 intrinsics and gcc inline assembler - gcc

I have tried to mix SSE2 intrinsics and inline assembler in gcc. But if I specify a variable as xmm0/register as input then in some cases I get a compiler error. Example:
#include <emmintrin.h>
int main() {
__m128i test = _mm_setzero_si128();
asm ("pxor %%xmm0, %%xmm0" : : "xmm0" (test) : );
}
When compiled with gcc version 4.6.1 I get:
>gcc asm_xmm.c
asm_xmm.c: In function ‘main’:
asm_xmm.c:10:3: error: matching constraint references invalid operand number
asm_xmm.c:7:5: error: matching constraint references invalid operand number
The strange thing is that in same cases where I have other input variables/registers then it suddenly works with xmm0 as input but not xmm1, etc. And in another case I was able to specify xmm0-xmm4 but not above. A little confused/frustrated about this :S
Thanks :)

You should let the compiler do the register assignment. Here's an example of pshufb (for gcc too old to have tmmintrin for SSSE3):
static inline __m128i __attribute__((always_inline))
_mm_shuffle_epi8(__m128i xmm, __m128i xmm_shuf)
{
__asm__("pshufb %1, %0" : "+x" (xmm) : "xm" (xmm_shuf));
return xmm;
}
Note the "x" qualifier on the arguments and simply %0 in the assembly itself, where the compiler will substitute in the register it selected.
Be careful to use the right modifiers. "+x" means xmm is both an input and an output parameter. If you are sloppy with these modifiers (eg using "=x" meaning output only when you needed "+x") you will run into cases where it sometimes works and sometimes doesn't.

Related

Confusion about different clobber description for arm inline assembly

I'm learning ARM inline assembly, and is confused about a very simple function: assign the value of x to y (both are int type), on arm32 and arm64 why different clobber description required?
Here is the code:
#include <arm_neon.h>
#include <stdio.h>
void asm_test()
{
int x = 10;
int y = 0;
#ifdef __aarch64__
asm volatile(
"mov %w[in], %w[out]"
: [out] "=r"(y)
: [in] "r"(x)
: "r0" // r0 not working, but r1 or x1 works
);
#else
asm volattile(
"mov %[in], %[out]"
: [out] "=r"(y)
: [in] "r"(x)
: "r0" // r0 works, but r1 not working
);
#endif
printf("y is %d\n", y);
}
int main() {
arm_test();
return 0;
}
Tested on my rooted android phone, for arm32, r0 generates correct result but r1 won't. For arm64, r1 or x1 generate correct result, and r0 won't. Why on arm32 and arm64 they are different? What is the concrete rule for this and where can I find it?
ARM / AArch64 syntax is mov dst, src
Your asm statement only works if the compiler happens to pick the same register for both "=r" output and "r" input (or something like that, given extra copies of x floating around).
Different clobbers simply perturb the compiler's register-allocation choices. Look at the generated asm (gcc -S or on https://godbolt.org/, especially with -fverbose-asm.)
Undefined Behaviour from getting the constraints mismatched with the instructions in the template string can still happen to work; never assume that an asm statement is correct just because it works with one set of compiler options and surrounding code.
BTW, x86 AT&T syntax does use mov src, dst, and many GNU C inline-asm examples / tutorials are written for that. Assembly language is specific to the ISA and the toolchain, but a lot of architectures have an instruction called mov. Seeing a mov does not mean this is an ARM example.
Also, you don't actually need a mov instruction to use inline asm to copy a valid. Just tell the compiler you want the input to be in the same register it picks for the output, whatever that happens to be:
// not volatile: has no side effects and produces the same output if the input is the same; i.e. the output is a pure function of the input.
asm (""
: "=r"(output) // pick any register
: "0"(input) // pick the same register as operand 0
: // no clobbers
);

Inline assembly multiplication "undefined reference" on inputs

Trying to multiply 400 by 2 with inline assembly, using the fact imul implicity multiplies by eax. However, i'm getting "undefined reference" compile errors to $1 and $2
int c;
int a = 400;
int b = 2;
__asm__(
".intel_syntax;"
"mov eax, $1;"
"mov ebx, $2;"
"imul %0, ebx;"
".att_syntax;"
: "=r"(c)
: "r" (a), "r" (b)
: "eax");
std::cout << c << std::endl;
Do not use fixed registers in inline asm, especially if you have not listed them as clobbers and have not made sure inputs or outputs don't overlap them. (This part is basically a duplicate of segmentation fault(core dumped) error while using inline assembly)
Do not switch syntax in inline assembly as the compiler will substitute wrong syntax. Use -masm=intel if you want intel syntax.
To reference arguments in an asm template string use % not $ prefix. There's nothing special about $1; it gets treated as a symbol name just like if you'd used my_extern_int_var. When linking, the linker doesn't find a definition for a $1 symbol.
Do not mov stuff around unnecessarily. Also remember that just because something seems to work in a certain environment, that doesn't guarantee it's correct and will work everywhere every time. Doubly so for inline asm. You have to be careful. Anyway, a fixed version could look like:
__asm__(
"imul %0, %1"
: "=r"(c)
: "r" (a), "0" (b)
: );
Has to be compiled using -masm=intel. Notice b has been put into the same register as c.
using the fact imul implicity multiplies by eax
That's not true for the normal 2-operand form of imul. It works the same as other instructions, doing dst *= src so you can use any register, and not waste uops writing the high half anywhere if you don't even want it.

Using a specific zmm register in inline asm

Can I tell gcc-style inline assembly to put my __m512i variable into a specific zmm register, like zmm31?
Like on targets where the are no specific-register constraints at all (like ARM), use local register variables to get broad constraints to pick a specific register for asm statements. The compiler can still optimize otherwise, because the only documented guaranteed effect of a register-local is for asm inputs/outputs.
The compiler will prefer the specified register even if there's no asm, though. (So you can write code that appears to work but isn't safe in general with stuff like register int ebx asm("ebx"); return ebx;. GCC documentation is what makes a behaviour guaranteed / future-proof, even if current gcc prefers using the specified register strongly enough to waste instructions when the constraint isn't compatible with the specified register, see below.)
Anyway, this use of register-asm local variables is the only thing they're guaranteed to work for:
#include <immintrin.h>
__m512i foo() {
register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
register __m512i z30 asm("zmm30");
asm("vmovdqa64 %1, %0 # from inline asm"
: "=v"(z30)
: "v"(z31)
);
return z30;
}
On the Godbolt compiler explorer, compiles to this with clang6.0:
# clang -O3 -march=skylake-avx512
vbroadcastss .LCPI0_0(%rip), %zmm31 # zmm31 = [1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43]
vmovdqa64 %zmm31, %zmm30 # from inline asm
vmovaps %zmm30, %zmm0
retq
and gcc8.2:
# gcc -O3 -march=skylake-avx512
foo():
movl $123, %eax
vpbroadcastd %eax, %zmm31
vmovdqa64 %zmm31, %zmm30 # from inline asm
vmovdqa64 %zmm30, %zmm0
ret
Note the "v" constraints which allow any EVEX vector register (0..31), unlike "x" which only allows the first 16. "x" is documented as "any SSE register", but also applies to AVX YMM registers. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
Using "x" for this didn't result in any warnings, but with gcc "x" won vs. the register-variable declaration, so it chose %zmm2 and %zmm1 (strangely not zmm0 so an extra move was required). The register-asm declaration thus did cost us efficiency.
With clang it still used zmm31 and zmm30, apparently violating the "x" constraint, so it would have failed to assemble if you'd used an instruction with no EVEX version on the XMM or YMM part of the register operand, like AVX2 vpcmpeqd ymm,ymm,ymm (compare into vector, not compare into mask). (In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?).
//#ifndef __clang__
__m512i broken_with_clang() {
register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
register __m512i z30 asm("zmm30") = _mm512_setzero_si512();
// notice that gcc still inits these in zmm31 and 30, *then* copies
// so register asm costs us efficiency.
// AVX512 only has compares into k registers, not into YMM registers.
asm("vpcmpeqd %t1, %t0, %t0 # from inline asm. input was %0"
: "+x"(z30)
: "x"(z31)
);
return z30;
}
//#endif
With clang we get an error for each operand; I guess clang doesn't support t modifiers to get the YMM name of the register (because it fails with clang6.0 even if I remove the register ... asm() stuff entirely.)
<source>:21:9: error: invalid operand in inline asm: 'vpcmpeqd ${1:t}, ${0:t}, ${0:t} # from inline asm. input was $0'
asm("vpcmpeqd %t1, %t0, %t0 # from inline asm. input was %0"
^
...
<source>:21:9: error: unknown token in expression
<inline asm>:1:11: note: instantiated into assembly here
vpcmpeqd , , # from inline asm. input was %zmm30
But gcc compiles it just fine:
broken_with_clang():
movl $123, %eax
vpbroadcastd %eax, %zmm31
vpxord %xmm30, %xmm30, %xmm30
vmovdqa64 %zmm30, %zmm1 # extra overhead because of register asm
vmovdqa64 %zmm31, %zmm2 # which didn't match the constraints
vpcmpeqd %ymm2, %ymm1, %ymm1 # from inline asm. input was %zmm1
vmovdqa64 %zmm1, %zmm0 # extra overhead because gcc didn't pick zmm0
ret

Can't call fseek with inline assembly

#include "stdio.h"
void fseek(void *, int, int);
main () {
FILE* f = fopen("myfile", "rb");
asm("push 2");
asm("push 0");
asm("push f");
asm("call fseek");
asm("add esp, 12");
}
gcc -masm=intel call.c
call.c:(.text+0x2c): undefined reference to `f'
call.c:(.text+0x31): undefined reference to `fseek'
I have been trying to use AT/T syntax but got the same result.
Well you can not write like this, since there is no grantee that symbol f would exist in the generated assembly -- it's merely a symbol in C.
The solution is to use GCC's extended asm syntax. For example, push f could be rewrited into this:
asm volatile ("pushl %0"
: /* no output operands */
: "m" (f)
: /* no clobbered operands */);
As for the function call fseek, I believed your code shall be alright (at least in my experience and on my laptop it works just now). What's your platform info? Do you have glibc or similar things providing the standard libraries of C?
Also Please notice you're using a weird declaration of fseek since it shall at least have a return value according to the C specification.
Just for your information, you may try this style of an indirect call:
asm volatile ("call *%0"
: /* no output operands */
: "r"(fseek)
: /* no clobbered operands */);

How to suppress "warning: control reaches end of non-void function"

I have some PowerPC assembly code translated with a gcc cross compiler with this function:
uint32_t fill_cache(void)
{
__asm__ ("addi 3, 0, 0\n"); /* R3 = 0 */
/* More asm here modifying R3 and filling the cache lines. */
}
which, under the PowerPC EABI, returns the value computed in R3. When compiling I get
foo.c:105: warning: control reaches end of non-void function
Is there a way to teach gcc that a value is actually returned? Or is there a way to suppress the warning (without removing -Wall or adding -Wno-*)? I would like to very selectively suppress this warning for only this function in order to leave the general warning level as high as possible.
It is not an option to make this function return void since the value computed is required by the caller.
Solution 1: with diagnostic pragmas you can locally suppress certain diagnostic checks. The specific option (which also is implied by -Wall) that complains for no return in a non-void function is -Wreturn-type. So the specific code to suppress the warning is:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wreturn-type"
/* Your code here */
#pragma GCC diagnostic pop
You can find out which option is causing the warning by compiling with -fdiagnostics-show-option. It will simply append the option to the warning message.
Solution 2: define a register variable and put it in the desired register. Refer to the variable in an inline assembler template, with the resulting code:
uint32_t fill_cache(void)
{
register uint32_t cacheVal __asm__ ("r3");
__asm__ __volatile__ ("addi %0, 0, 0" : "=r" (cacheVal));
/* More code here */
return cacheVal;
}
The volatile modifier is to ensure that the instruction is not removed or in some other way affected undesirably by the optimization strategy.
Solution 2 is preferred for at least two reasons:
The value of a no returning non-void function is undefined as far as the standard is concerned.
There's no risk of suppressing (new) diagnostic warnings there was no intention to suppress in the first place.
Function could be declared as naked, in this case compiler would not generate prolog & epilog and would assume that programmer preserves all necessary registers and puts output value into correct register(s) before return.
uint32_t fill_cache(void) __attribute__((naked)); // Declaration
// attribute should be specified in declaration not in implementation
uint32_t fill_cache(void)
{
__asm__ ("addi 3, 0, 0\n"); /* R3 = 0 */
/* More asm here modifying R3 and filling the cache lines. */
}
A bit late but maybe someone will step in this as well :)
PS: For my best knowledge __asm__ as well as __volatile__ are std=c89 syntax. Practically there is not difference between __asm__ & asm in GNU GCC. But the modern approach is underscoreless style: asm volatile.
asm_language

Resources