Inline assembly multiplication "undefined reference" on inputs - gcc

Trying to multiply 400 by 2 with inline assembly, using the fact imul implicity multiplies by eax. However, i'm getting "undefined reference" compile errors to $1 and $2
int c;
int a = 400;
int b = 2;
__asm__(
".intel_syntax;"
"mov eax, $1;"
"mov ebx, $2;"
"imul %0, ebx;"
".att_syntax;"
: "=r"(c)
: "r" (a), "r" (b)
: "eax");
std::cout << c << std::endl;

Do not use fixed registers in inline asm, especially if you have not listed them as clobbers and have not made sure inputs or outputs don't overlap them. (This part is basically a duplicate of segmentation fault(core dumped) error while using inline assembly)
Do not switch syntax in inline assembly as the compiler will substitute wrong syntax. Use -masm=intel if you want intel syntax.
To reference arguments in an asm template string use % not $ prefix. There's nothing special about $1; it gets treated as a symbol name just like if you'd used my_extern_int_var. When linking, the linker doesn't find a definition for a $1 symbol.
Do not mov stuff around unnecessarily. Also remember that just because something seems to work in a certain environment, that doesn't guarantee it's correct and will work everywhere every time. Doubly so for inline asm. You have to be careful. Anyway, a fixed version could look like:
__asm__(
"imul %0, %1"
: "=r"(c)
: "r" (a), "0" (b)
: );
Has to be compiled using -masm=intel. Notice b has been put into the same register as c.
using the fact imul implicity multiplies by eax
That's not true for the normal 2-operand form of imul. It works the same as other instructions, doing dst *= src so you can use any register, and not waste uops writing the high half anywhere if you don't even want it.

Related

Confusion about different clobber description for arm inline assembly

I'm learning ARM inline assembly, and is confused about a very simple function: assign the value of x to y (both are int type), on arm32 and arm64 why different clobber description required?
Here is the code:
#include <arm_neon.h>
#include <stdio.h>
void asm_test()
{
int x = 10;
int y = 0;
#ifdef __aarch64__
asm volatile(
"mov %w[in], %w[out]"
: [out] "=r"(y)
: [in] "r"(x)
: "r0" // r0 not working, but r1 or x1 works
);
#else
asm volattile(
"mov %[in], %[out]"
: [out] "=r"(y)
: [in] "r"(x)
: "r0" // r0 works, but r1 not working
);
#endif
printf("y is %d\n", y);
}
int main() {
arm_test();
return 0;
}
Tested on my rooted android phone, for arm32, r0 generates correct result but r1 won't. For arm64, r1 or x1 generate correct result, and r0 won't. Why on arm32 and arm64 they are different? What is the concrete rule for this and where can I find it?
ARM / AArch64 syntax is mov dst, src
Your asm statement only works if the compiler happens to pick the same register for both "=r" output and "r" input (or something like that, given extra copies of x floating around).
Different clobbers simply perturb the compiler's register-allocation choices. Look at the generated asm (gcc -S or on https://godbolt.org/, especially with -fverbose-asm.)
Undefined Behaviour from getting the constraints mismatched with the instructions in the template string can still happen to work; never assume that an asm statement is correct just because it works with one set of compiler options and surrounding code.
BTW, x86 AT&T syntax does use mov src, dst, and many GNU C inline-asm examples / tutorials are written for that. Assembly language is specific to the ISA and the toolchain, but a lot of architectures have an instruction called mov. Seeing a mov does not mean this is an ARM example.
Also, you don't actually need a mov instruction to use inline asm to copy a valid. Just tell the compiler you want the input to be in the same register it picks for the output, whatever that happens to be:
// not volatile: has no side effects and produces the same output if the input is the same; i.e. the output is a pure function of the input.
asm (""
: "=r"(output) // pick any register
: "0"(input) // pick the same register as operand 0
: // no clobbers
);

How to have GCC combine "move r10, r3; store r10" into a "store r3"?

I'm working Power9 and utilizing the hardware random number generator instruction called DARN. I have the following inline assembly:
uint64_t val;
__asm__ __volatile__ (
"xor 3,3,3 \n" // r3 = 0
"addi 4,3,-1 \n" // r4 = -1, failure
"1: \n"
".byte 0xe6, 0x05, 0x61, 0x7c \n" // r3 = darn 3, 1
"cmpd 3,4 \n" // r3 == -1?
"beq 1b \n" // retry on failure
"mr %0,3 \n" // val = r3
: "=g" (val) : : "r3", "r4", "cc"
);
I had to add a mr %0,3 with "=g" (val) because I could not get GCC to produce expected code with "=r3" (val). Also see Error: matching constraint not valid in output operand.
A disassembly shows:
(gdb) b darn.cpp : 36
(gdb) r v
...
Breakpoint 1, DARN::GenerateBlock (this=<optimized out>,
output=0x7fffffffd990 "\b", size=0x100) at darn.cpp:77
77 DARN64(output+i*8);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.ppc64le libgcc-4.8.5-28.el7_5.1.ppc64le libstdc++-4.8.5-28.el7_5.1.ppc64le
(gdb) disass
Dump of assembler code for function DARN::GenerateBlock(unsigned char*, unsigned long):
...
0x00000000102442b0 <+48>: addi r10,r8,-8
0x00000000102442b4 <+52>: rldicl r10,r10,61,3
0x00000000102442b8 <+56>: addi r10,r10,1
0x00000000102442bc <+60>: mtctr r10
=> 0x00000000102442c0 <+64>: xor r3,r3,r3
0x00000000102442c4 <+68>: addi r4,r3,-1
0x00000000102442c8 <+72>: darn r3,1
0x00000000102442cc <+76>: cmpd r3,r4
0x00000000102442d0 <+80>: beq 0x102442c8 <DARN::GenerateBlock(unsigned char*, unsigned long)+72>
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
Notice GCC faithfully reproduces the:
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
How do I get GCC to fold the two instructions into:
0x00000000102442d8 <+84>: stdu r3,8(r9)
GCC will never remove text that's part of the asm template; it doesn't even parse it other than substituting in for %operand. It's literally just a text substitution before the asm is sent to the assembler.
You have to leave out the mr from your inline asm template, and tell gcc that your output is in r3 (or use a memory-destination output operand, but don't do that). If your inline-asm template ever starts or ends with mov instructions, you're usually doing it wrong.
Use register uint64_t foo asm("r3"); to force "=r"(foo) to pick r3 on platforms that don't have specific-register constraints.
(Despite ISO C++17 removing the register keyword, this GNU extension still works with -std=c++17. You can also use register uint64_t foo __asm__("r3"); if you want to avoid the asm keyword. You probably still need to treat register as a reserved word in source that uses this extension; that's fine. ISO C++ removing it from the base language doesn't force implementations to not use it as part of an extension.)
Or better, don't hard-code a register number. Use an assembler that supports the DARN instruction. (But apparently it's so new that even up-to-date clang lacks it, and you'd only want this inline asm as a fallback for gcc too old to support the __builtin_darn() intrinsic)
Using these constraints will let you remove the register setup, too, and use foo=0 / bar=-1 before the inline asm statement, and use "+r"(foo).
But note that darn's output register is write-only. There's no need to zero r3 first. I found a copy of IBM's POWER ISA instruction set manual that is new enough to include darn here: https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf#page=96
In fact, you don't need to loop inside the asm at all, you can leave that to the C and only wrap the one asm instruction, like inline-asm is designed for.
uint64_t random_asm() {
register uint64_t val asm("r3");
do {
//__asm__ __volatile__ ("darn 3, 1");
__asm__ __volatile__ (".byte 0x7c, 0x61, 0x05, 0xe6 # gcc asm operand = %0\n" : "=r" (val));
} while(val == -1ULL);
return val;
}
compiles cleanly (on the Godbolt compiler explorer) to
random_asm():
.L6: # compiler-generated label, no risk of name clashes
.byte 0x7c, 0x61, 0x05, 0xe6 # gcc asm operand = 3
cmpdi 7,3,-1 # compare-immediate
beq 7,.L6
blr
Just as tight as your loop, with less setup. (Are you sure you even need to zero r3 before the asm instruction?)
This function can inline anywhere you want it to, allowing gcc to emit a store instruction that reads r3 directly.
In practice, you'll want to use a retry counter, as advised in the manual: if the hardware RNG is broken, it might give you failure forever so you should have a fallback to a PRNG. (Same for x86's rdrand)
Deliver A Random Number (darn) - Programming Note
When the error value is obtained, software is
expected to repeat the operation. If a non-error
value has not been obtained after several attempts,
a software random number generation method
should be used. The recommended number of
attempts may be implementation specific. In the
absence of other guidance, ten attempts should be
adequate.
xor-zeroing is not efficient on most fixed-instruction-width ISAs, because a mov-immediate is just as short so there's no need to detect and special-case an xor. (And thus CPU designs don't spend transistors on it). Moreover, dependency rules for the PPC asm equivalent of C++11 std::memory_order_consume require it to carry a dependency on the input register, so it couldn't be dependency-breaking even if the designers wanted it to. xor-zeroing is only a thing on x86 and maybe a few other variable-width ISAs.
Use li r3, 0 like gcc does for int foo(){return 0;} https://godbolt.org/z/-gHI4C.

How to move a 64bit pointer into the RAX register?

I have the following code in a GNU C program:
void *segment = malloc(1024);
asm volatile("mov $%0, %%rax" : : "r" (segment));
And I get the following error:
Error: illegal immediate register operand %rax
What is wrong with %rax?
While FrankH's points are valid, strictly speaking cause of this error is the dollar sign. Dollar signs in assembler are used to denote constants. So "mov $1, %%eax" would work. However, your code generates:
mov $%rax, %rax
$%rax is meaningless and generates a error. This will resolve the error:
void *segment = malloc(1024);
asm volatile("mov %0, %%rax" : : "r" (segment));
Since malloc will return its value in rax, this will (most likely) generate "mov %rax, %rax".
In other words, it will still be meaningless, unsafe and inefficient, but it will compile without error.
Assuming this code is intended to be more than an experiment to teach you something about using asm, you will need to provide more details to get a more useful answer.

Use both SSE2 intrinsics and gcc inline assembler

I have tried to mix SSE2 intrinsics and inline assembler in gcc. But if I specify a variable as xmm0/register as input then in some cases I get a compiler error. Example:
#include <emmintrin.h>
int main() {
__m128i test = _mm_setzero_si128();
asm ("pxor %%xmm0, %%xmm0" : : "xmm0" (test) : );
}
When compiled with gcc version 4.6.1 I get:
>gcc asm_xmm.c
asm_xmm.c: In function ‘main’:
asm_xmm.c:10:3: error: matching constraint references invalid operand number
asm_xmm.c:7:5: error: matching constraint references invalid operand number
The strange thing is that in same cases where I have other input variables/registers then it suddenly works with xmm0 as input but not xmm1, etc. And in another case I was able to specify xmm0-xmm4 but not above. A little confused/frustrated about this :S
Thanks :)
You should let the compiler do the register assignment. Here's an example of pshufb (for gcc too old to have tmmintrin for SSSE3):
static inline __m128i __attribute__((always_inline))
_mm_shuffle_epi8(__m128i xmm, __m128i xmm_shuf)
{
__asm__("pshufb %1, %0" : "+x" (xmm) : "xm" (xmm_shuf));
return xmm;
}
Note the "x" qualifier on the arguments and simply %0 in the assembly itself, where the compiler will substitute in the register it selected.
Be careful to use the right modifiers. "+x" means xmm is both an input and an output parameter. If you are sloppy with these modifiers (eg using "=x" meaning output only when you needed "+x") you will run into cases where it sometimes works and sometimes doesn't.

Using C arrays in inline GCC assembly

I'd like to use two array passed into a C function as below in assembly using a GCC compiler (Xcode on Mac). It has been many years since I've written assembly, so I'm sure this is an easy fix.
The first line here is fine. The second line fails. I'm trying to do the following, A[0] += x[0]*x[0], and I want to do this for many elements in the array with different indices. I'm only showing one here. How do I use a read/write array in the assembly block?
And if there is a better approach to do this, I'm open ears.
inline void ArrayOperation(float A[36], const float x[8])
{
float tmp;
__asm__ ( "fld %1; fld %2; fmul; fstp %0;" : "=r" (tmp) : "r" (x[0]), "r" (x[0]) );
__asm__ ( "fld %1; fld %2; fadd; fstp %0;" : "=r" (A[0]) : "r" (A[0]), "r" (tmp) );
// ...
}
The reason why the code fails is not because of arrays, but because of the way fld and fst instructions work. This is the code you want:
float tmp;
__asm__ ( "flds %1; fld %%st(0); fmulp; " : "=t" (tmp) : "m" (x[0]) );
__asm__ ( "flds %1; fadds %2;" : "=t" (A[0]) : "m" (A[0]), "m" (tmp) );
fld and fst instructions need a memory operand. Also, you need to specify if you want to load float (flds), double (fldl) or long double (fldt). As for the output operands, I just use a constraint =t, which simply tells the compiler that the result is on the top of the register stack, i.e. ST(0).
Arithmetic operations have either no operands (fmulp), or a single memory operand (but then you have to specify the size again, fmuls, fadds etc.).
You can read more about inline assembler, GNU Assembler in general, and see the Intel® 64 and IA-32 Architectures Software Developer’s Manual.
Of course, it is best to get rid of the temporary variable:
__asm__ ( "flds %1; fld %%st(0); fmulp; fadds %2;" : "=t" (A[0]) : "m" (x[0]), "m" (A[0]));
Though if a performance improvement is what you're after, you don't need to use assembler. GCC is completely capable of producing this code. But you might consider using vector SSE instructions and other simple optimization techniques, such as breaking the dependency chains in the calculations, see Agner Fog's optimization manuals

Resources