In all flavors of GCC, local variables that don't fit into registers are stored on the stack. For accessing them, one uses constructs like [ESP+n] or [EBP-n], where n might involve an offset within the variable.
When passing such variables to GCC inline assembly as operands, a spare register is used to store the calculated address. Is there a way to designate operands as "the base register of this variable" and/or "the offset of this variable relative to the base register"?
If you do something like
int stackvar;
...
asm ("...":"r"(stackvar))
you force GCC to load stackvar into register. If you add m constraint, you don't:
int stackvar;
...
asm ("...":"rm"(stackvar))
Related
While learning gcc inline assembly I was playing a bit with memory access. I'm trying to read a value from an array using a value from a different array as index.
Both arrays are initialized to something.
Initialization:
uint8_t* index = (uint8_t*)malloc(256);
memset(index, 33, 256);
uint8_t* data = (uint8_t*)malloc(256);
memset(data, 44, 256);
Array access:
unsigned char read(void *index,void *data) {
unsigned char value;
asm __volatile__ (
" movzb (%1), %%edx\n"
" movzb (%2, %%edx), %%eax\n"
: "=r" (value)
: "c" (index), "c" (data)
: "%eax", "%edx");
return value;
}
This is how I use the function:
unsigned char value = read(index, data);
Now I would expect it to return 44. But it actually returns me some random value. Am I reading from uninitialzed memory? Also I'm not sure how to tell the compiler that it should assign the value from eax to the variable value.
You told the compiler you were going to put the output in %0, and it could pick any register for that "=r". But instead you never write %0 in your template.
And you use two temporaries for no apparent reason when you could have used %0 as the temporary.
As usual, you can debug your inline asm by adding comments like # 0 = %0 and looking at the compiler's asm output. (Not disassembly, just gcc -S to see what it fills in. e.g. # 0 = %ecx. (You didn't use an early-clobber "=&r" so it can pick the same register as inputs).
Also, this has 2 other bugs:
doesn't compile. Requesting 2 different operands in ECX with "c" constraints can't work unless the compiler can prove at compile-time that they have the same value so %1 and %2 can be the same register. https://godbolt.org/z/LgR4xS
You dereference pointer inputs without telling the compiler you're reading the pointed-to memory. Use a "memory" clobber or dummy memory operands. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
Or better https://gcc.gnu.org/wiki/DontUseInlineAsm because it's useless for this; just let GCC emit the movzb loads itself. unsigned char* is safe from strict-aliasing UB so you can safely cast any pointer to unsigned char* and dereference it, without even having to use memcpy or other hacks to fight against language rules for wider unaligned or type-punned accesses.
But if you insist on inline asm, read manuals and tutorials, links at https://stackoverflow.com/tags/inline-assembly/info. You can't just throw code at the wall until it sticks with inline asm: you must understand why your code is safe to have any hope of it being safe. There are many ways for inline asm to happen to work but actually be broken, or be waiting to break with different surrounding code.
This is a safe and not totally terrible version (other than the unavoidable optimization-defeating parts of inline asm). You do still want a movzbl load for both loads, even though the return value is only 8 bits. movzbl is the natural efficient way to load a byte, replacing instead of merging with the old contents of a full register.
unsigned char read(void *index, void *data)
{
uintptr_t value;
asm (
" movzb (%[idx]), %k[out] \n\t"
" movzb (%[arr], %[out]), %k[out]\n"
: [out] "=&r" (value) // early-clobber output
: [idx] "r" (index), [arr] "r" (data)
: "memory" // we deref some inputs as pointers
);
return value;
}
Note the early-clobber on the output: this stops gcc from picking the same register for output as one of the inputs. It would be safe for it to destroy the [idx] register with the first load, but I don't know how to tell GCC that in one asm statement. You could split your asm statement into two separate ones, each with their own input and output operands, connecting the output of the first to the input of the 2nd via a local variable. Then neither one would need early-clobber because they're just wrapping single instructions like GNU C inline asm syntax is designed to do nicely.
Godbolt with test caller to see how it inlines / optimizes when called twice, with i386 clang and x86-64 gcc. e.g. asking for index in a register forces an LEA, instead of letting the compiler see the deref and letting it pick an addressing mode for *index. Also the extra movzbl %al, %eax done by the compiler when adding to unsigned sum because we used a narrow return type.
I used uintptr_t value so this can compile for 32-bit and 64-bit x86. There's no harm in making the output from the asm statement wider than the return value of the function, and that saves us from having to use size modifiers like movzbl (%1), %k0 to get GCC to print the 32-bit register name (like EAX) if it chose AL for an 8-bit output variable, for example.
I did decided to actually use %k[out] for the benefit of 64-bit mode: we want movzbl (%rdi), %eax, not movzb (%rdi), %rax (wasting a REX prefix).
You might as well declare the function to return unsigned int or uintptr_t, though, so the compiler knows that it doesn't have to redo zero-extension. OTOH sometimes it can help the compiler to know that the value-range is only 0..255. You could tell it that you produce a correctly-zero-extend value using if(retval>255) __builtin_unreachable() or something. Or you could just not use inline asm.
You don't need asm volatile. (Assuming you want to let it optimize away if the result is unused, or be hoisted out of loops for constant inputs). You only need a "memory" clobber so if it does get used, the compiler knows that it reads memory.
(A "memory" clobber counts as all memory being an input, and all memory being an output. So it can't CSE, e.g. hoist out of a loop, because as far as the compiler knows one invocation might read something a previous one wrote. So in practice a "memory" clobber is about as bad as asm volatile. Even two back-to-back calls to this function without touching the input array force the compiler to emit the instructions twice.)
You could avoid this with dummy memory-input operands so the compiler knows this asm block doesn't modify memory, only read it. But if you actually care about efficiency, you shouldn't be using inline asm for this.
But like I said there is zero reason to use inline asm:
This will do exactly the same thing in 100% portable and safe ISO C:
// safe from strict-aliasing violations
// because unsigned char* can alias anything
inline
unsigned char read(void *index, void *data) {
unsigned idx = *(unsigned char*)index;
unsigned char * dp = data;
return dp[idx];
}
You could cast one or both pointers to volatile unsigned char* if you insist on the access happening every time and not being optimized away.
Or maybe even to atomic<unsigned char> * depending on what you're doing. (That's a hack, prefer C++20 atomic_ref to atomically load/store on objects that are normally not atomic.)
How to force GCC to pass 128bits/256bits struct as function param in xmm/ymm register?
ie. if my struct is 256bits wide (UnsignedLongLongStruct below)
(I know if I use intrinsics to make a packed integer, gcc is smart enough to put it into %ymm register, but can I do it with struct ?)
typedef struct {
unsigned long long ull1;
unsigned long long ull2;
unsigned long long ull3;
unsigned long long ull4;
} UnsignedLongLongStruct;
void func1( UnsignedLongLongStruct unsignedLongLongStruct ) {
....
}
TL;DR: It seems the calling conventions explicitly mention __m256 and friends to be placed in the umm regs.
In X86-64 System V ABI, point 3.2.3, you can check how parameters are passed. My reading is that only __m256 arguments will be turned into one SSE and 3 SSEUP 8-byte chunks, which allows them to be passed in a ymm register.
This will make it so that your argument gets passed in memory, which is what we see in clang, gcc, and icc: Test program on godbolt
In order to pass it as a register, as I read the calling conventions, it seem that you have to pass it as a __m256 (or a variant of it).
The calling conventions are a bit of a mess across different platforms and compilers. You should pass the input to your function by value as an __m256.
If it's a trivial function and you want to ensure GCC inlines it, you could declare it with the always_inline attribute to avoid any unnecessary loads/stores:
inline __attribute__((always_inline)) __m256 foo(__m256 const input);
For the following statement inside function func(), I'm trying to figure out the variable name (which is 'dictionary' in the example) that points to the malloc'ed memory region.
Void func() {
uint64_t * dictionary = (uint64_t *) malloc ( sizeof(uint64_t) * 128 );
}
The instrumented malloc() can record the start address and size of the allocation. However, no knowledge of variable 'dictionary' that will be assigned to, any features from the compilers side can help to solve this problem, without modifying the compiler to instrument such assignment statements?
One way I've been thinking is to use the feature that variable 'dictionary' and function 'malloc' is on one source code line or next to each other, the dwarf provides line information.
One thing you can do with Clang and LLVM is emit the code with debug information and then look for malloc calls. These will be assigned to LLVM values, which can be traced (when not compiled with optimizations, that is) to the original C/C++ source code via the debug information metadata.
I'm programming for Windows in assembly in NASM, and i found this in the code:
extern _ExitProcess#4
;Rest of code...
; ...
call _ExitProcess#4
What does the #4 mean in the declaration and call of a winapi library function?
The winapi uses the __stdcall calling convention. The caller pushes all the arguments on the stack from right to left, the callee pops them again to cleanup the stack, typically with a RET n instruction.
It is the antipode of the __cdecl calling convention, the common default in C and C++ code where the caller cleans up the stack, typically with an ADD ESP,n instruction after the CALL. The advantage of __stdcall is that it is generates more compact code, just one cleanup instruction in the called function instead of many for each call to the function. But one big disadvantage: it is dangerous.
The danger lurks in the code that calls the function having been compiled with an out-dated declaration of the function. Typical when the function was changed by adding an argument for example. This ends very poorly, beyond the function trying to use an argument that is not available, the new function pops too many arguments off the stack. This imbalances the stack, causing not just the callee to fail but the caller as well. Extremely hard to diagnose.
So they did something about that, they decorated the name of the function. First with a leading _underscore, as is done for __cdecl functions. And appended #n, the value of n is the operand of the RET instruction at the end of the function. Or in other words, the number of bytes taken by the arguments on the stack.
This provides a linker diagnostic when there's a mismatch, a change in a foo(int) function to foo(int, int) for example generates the name _foo#8. The calling code not yet recompiled will look for a _foo#4 function. The linker fails, it cannot find that symbol. Disaster avoided.
The name decoration scheme for C is documented at Format of a C Decorated Name. A decorated name containing a # character is used for the __stdcall calling convention:
__stdcall: Leading underscore (_) and a trailing at sign (#) followed by a number representing the number of bytes in the parameter list
Tools like Dependency Walker are capable of displaying both decorated and undecorated names.
Unofficial documentation can be found here: Name Decoration
It's a name decoration specifying the total size of the function's arguments:
The name is followed by the at sign (#) followed by the number of bytes (in decimal) in the argument list.
(source)
Output register in inline assembly must be declared with the "=" constraint, meaning "write-only" [1]. What exactly does this mean - is it truly forbidden to read and modify them within the assembly? For example, consider this code:
uint8_t one ()
{
uint8_t res;
asm("ldi %[res],0\n"
"inc %[res]\n"
: [res] "=r" (res)
);
return res;
}
The assembly sets the output register to 0 then increments it. Is this breaking the "write-only" constraint?
UPDATE
I'm seeing problems where my inline asm breaks when I change it to work directly on an output register, as opposed to using r16 for the computation and finally mov'ing r16 into the output register. The code is here: http://ideone.com/JTpYma . It prints results to serial, you just need to define F_CPU and BAUD. The problem appears only when using gcc-4.8.0 and not using gcc-4.7.2.
[1] http://www.nongnu.org/avr-libc/user-manual/inline_asm.html
The compiler doesn't care whether you read it or not, it just won't put the initial value of the variable into the register. Your example is entirely legal, but people often wrongly expect to get result 2 from this code:
uint8_t one ()
{
uint8_t res = 1;
asm("inc %[res]\n"
: [res] "=r" (res)
);
return res;
}
Since it's only an output constraint, the initial value of res is not guaranteed to be loaded into the register. In fact, the initializer may even be optimized away on the assumption that the asm block will overwrite it anyway. The above code is compiled to this by my version of avr-gcc:
inc r24
ret
As you can see, the compiler indeed removed loading 1 into res and hence into r24 thus producing undefined result.
Update
The problem with the updated program in the question is that it also has an input register operand. By default the compiler assumes that all inputs are consumed before the outputs are assigned so it's safe to allocate overlapping registers. That's clearly not the case for your example. You should use an "early clobber" modifier (&) for the output. This is what the manual has to say about that:
& Means (in a particular alternative) that this operand is an
earlyclobber operand, which is modified before the instruction is
finished using the input operands. Therefore, this operand may not lie
in a register that is used as an input operand or as part of any
memory address.
Nobody said gcc inline asm was easy :D