gcc, __atomic_exchange seems to produce non-atomic asm, why? - gcc

I am working on a nice tool, which requires the atomic swap of two different 64-bit values. On the amd64 architecture it is possible with the XCHGQ instruction (see here in doc, warning: it is a long pdf).
Correspondingly, gcc has some atomic builtins which would ideally do the same, as it is visible for example here.
Using these 2 docs I produced the following simple C function, for the atomic swapping of two, 64-bit values:
void theExchange(u64* a, u64* b) {
__atomic_exchange(a, b, b, __ATOMIC_SEQ_CST);
};
(Btw, it wasn't really clear to me, why needs an "atomic exchange" 3 operands.)
It was to me a little bit fishy, that the gcc __atomic_exchange macro uses 3 operands, so I tested its asm output. I compiled this with a gcc -O6 -masm=intel -S and I've got the following output:
.LHOTB0:
.p2align 4,,15
.globl theExchange
.type theExchange, #function
theExchange:
.LFB16:
.cfi_startproc
mov rax, QWORD PTR [rsi]
xchg rax, QWORD PTR [rdi] /* WTF? */
mov QWORD PTR [rsi], rax
ret
.cfi_endproc
.LFE16:
.size theExchange, .-theExchange
.section .text.unlikely
As we can see, the result function contains not only a single data move, but three different data movements. Thus, as I understood this asm code, this function won't be really atomic.
How is it possible? Maybe I misunderstood some of the docs? I admit, the gcc builtin doc wasn't really clear to me.

This is the generic version of __atomic_exchange_n (type *ptr, type val, int memorder) where only the exchange operation on ptr is atomic, the reading of val is not. In the generic version, val is accessed via pointer, but the atomicity still does not apply to it. The pointer is so that it will work with multiple sizes, when the compiler has to call an external helper:
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in
function if the specific data type size makes that possible;
otherwise, an external call is left to be resolved at run time. This
external call is the same format with the addition of a ‘size_t’
parameter inserted as the first parameter indicating the size of the
object being pointed to. All objects must be the same size.

Related

Why does generated assembly mov edi to variable on stack?

I am a newcomer to assembly trying to understand the objdump of the following function:
int nothing(int num) {
return num;
}
This is the result (linux, x86-64, gcc 8):
push rbp
mov rbp,rsp
mov DWORD PTR [rbp-0x4],edi
mov eax,DWORD PTR [rbp-0x4]
pop rbp
ret
My questions are:
1. Where does edi come from? Reading through some intro docs, I was under the impression that [rbp-0x4] would contain num.
2. From the above, apparently edi contains the argument. But then what role does [rbp-0x4] play? Why not just mov eax, edi?
Thanks!
Where does edi come from?
... From the above, apparently edi contains the argument.
This is the calling convention (for Linux and many other OSs):
All programming languages for these OSs pass the first parameter in rdi. The result (value returned) is passed in rax.
And because your C compiler interprets int as 32 bits, only the low 32 bits of rdi and rax are used - which is edi and eax.
Programming languages for Windows pass the first parameter in rcx...
But then what role does [rbp-0x4] play?
Using rbp has mainly historic reasons here. In 16-bit code (as it was used in 1980s and 1990s PCs) it was not possible to address data on the stack using the sp register (which corresponds to rsp). The only register that allowed addressing values on the stack easily was the bp register (corresponding to rbp).
And even in 32- or 64-bit code it is more difficult to write a compiler that addresses local variables (on the stack) using rsp rather than using rbp.
The compiler generates the first 3 instructions of assembler code before it knows what is done in the C function. The compiler puts the value on the stack because you could do something like address = &num in the code. This is however not possible when num is in a register but only when num is located in the memory.
Why not just mov eax, edi?
If you tell the compiler to optimize the code, it will first check the content of the C function before generating the first assembler instruction. It will find out that it is not required to put the value into the memory.
In this case the code will indeed look like this:
mov eax, edi
ret

Can I get rid of a sign-extend between CTZ and addition to a pointer?

For code such as this:
#include <stdint.h>
char* ptrAdd(char* ptr, uint32_t x)
{
return ptr + (uint32_t)__builtin_ctz(x);
}
GCC generates a sign-extension: (godbolt link)
xor eax, eax
rep bsf eax, esi
cdqe ; sign-extend eax into rax
add rax, rdi
ret
This is, of course, completely redundant - this is blatantly sign-extending an unsigned integer. Can I convince GCC not to do this?
The problem exists since GCC 4.9.0, but before that it used to be an explicit zero-extension which is also redundant.
A partial solution is to use the 64-bit version of ctz, along with a -march argument so that tzcnt is used instead of bsf, like so:
char* ptrAdd(char* ptr, uint32_t x)
{
return ptr + __builtin_ctzl(x);
}
This results in no sign extension:
ptrAdd(char*, unsigned int):
mov eax, esi
tzcnt rax, rax
add rax, rdi
ret
It has a mov (to do the 32 to 64-bit zero extension) which replaced a zeroing xor in the 32-bit version (which was there to work around the tzcnt false-dependency-on-destination issue). Those are about the same cost, but the mov is more likely to disappear after inlining. The result of a 64-bit tzcnt is the same as a 32-bit one, except for the case of zero input which is undefined (as far as the gcc intrinsic goes, not tzcnt).
Unfortunately, without a -march argument that lets the compiler use tzcnt it will use bsf and in that case still does the sign extension.
It seems that the origin of the differing behavior between bsf and tzcnt is that in the case that bsf version is used, the instruction behavior is undefined at zero. So in principle, the instruction could return anything, even values outside the range 0 to 63 that we would normally expect. Combined with the fact that the return value is declared as int, simply omitting the sign extension could lead to "impossible" situations like (__builtin_clzl (x) & 0xff) == 0xdeadbeef.
Now per the gcc docs, zero input to __builtin_ctzl has an "undefined result" - but it isn't clear if this is the same as C/C++ "undefined behavior" where anything can happen (which would allow impossible things), or just means "some unspecified value".
You can read about this on the gcc bugzilla, where an issue has been open for about 7 years.

GCC compiled assembly

I am trying to learn assembly language by example, or compiling simple C files with GCC using the -S option, intel syntax, and CFI calls disabled (every other free way is extremely confusing
My C file is literally just int main() {return 0;}, but GCC spits out this:
.file "simpleCTest.c"
.intel_syntax noprefix
.def ___main; .scl 2; .type 32; .endef
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
push ebp
mov ebp, esp
and esp, -16
call ___main
mov eax, 0
leave
ret
.ident "GCC: (GNU) 5.3.0"
My real question is why does the main function have any processor instructions (push edp, mov edp, esp, etc)? Are these even necessary (I guess it would be a way of data management to prepare/shut down programs, but I'm not sure)? Why doesn't it just issue a ret statement after the main function? Also why are there TWO main functions (_main & ___main)?
To sum it up, why is it not just like this?
.def _main
_main:
mov eax, 0 ;(for return integer)
ret
GCC spits out this
This would probably be a bit clearer if you actually had your main function do some things, oddly enough, including calling another function.
Your compiled code is setting up a frame by which to reference its stack variables with the first opcode, mov ebp,esp. This would be used if you had variables that could be referred to with ebp and a constant, for instance. Then, it is aligning the stack to a multiple of 16 bytes with the AND instruction- that is, it is saying it will not use from 0 to 15 bytes of the provided stack, such that [esp] is aligned to a multiple of 16 bytes. This would be important because of the calling conventions in use.
The ending opcode leave copies the backed up base pointer over the current state of the stack pointer, and then restores the original base pointer with pop.
My real question is why does the main function have any processor instructions
It's setting stuff up for things that you aren't doing (but that nontrivial programs would do), and is not making the most optimized "return 0" program that it could. By having a base pointer that is mostly a backup of the original stack pointer, the program is free to refer to local variables as an offset plus the base pointer (including implied stuff you aren't using like the argument count, the pointer to the pointers to argument listing, and the pointer to the environment), and by having a stack pointer that is a multiple of 16, the program is free to make calls to functions according to its calling standard.

Some questions about prologue/calling a function gcc intel x86

I dont quiet understand the gcc prologue, especially for main.
Why is there the instruction and esp, 0xfffffff0 ? I know what it does but why is it necessary ?
When we call a function, we first have to push the arguments, but why gcc doesn't use the push instruction and uses movs instead ? Moreover using those movs, it creates an empty padding. It looks like a waste of memory, why so ?
Finally, gcc first uses the sub instruction to esp in order to "reserve" memory for the stack, but what makes sure that this memory is not used by on other program for instance ?
I think I understood quiet well the theory, but I couldnt find a document that explains more about memory in pratice (how do memory of several programs dont overlap, ...). Thank you for your answers.
PS : I add the assembly code and the cpp code :
Dump of assembler code for function main(int, char**):
0x08048657 <+0>: push ebp
0x08048658 <+1>: mov ebp,esp
0x0804865a <+3>: and esp,0xfffffff0
0x0804865d <+6>: sub esp,0x20
0x08048660 <+9>: mov DWORD PTR [esp+0x1c],0x3
0x08048668 <+17>: mov BYTE PTR [esp+0x1b],0x61
=> 0x0804866d <+22>: mov DWORD PTR [esp],0x8048771
0x08048674 <+29>: call 0x804863c <p(char*)>
0x08048679 <+34>: mov eax,0x0
0x0804867e <+39>: leave
0x0804867f <+40>: ret
End of assembler dump.
int main(int argc, char *argv[]) {
int b = 3;
char c = 'a';
p("hello woooooooooorld !!");}
The stack alignment is only done for main, the rest of the functions just keep the alignment required by the ABI.
The compiler uses mov instructions for locals so they can be accessed randomly. For outgoing function arguments you can ask for push instructions using the -mpush-args compiler option which might produce smaller code.
As for the wasted memory, you probably didn't compile with optimizations enabled (which would of course eliminate your b and c altogether since they are not used ;))
Each process has its own virtual memory address space, so there is no chance of anybody else using the memory allocated from the stack.

How does gcc know the register size to use in inline assembly?

I have the inline assembly code:
#define read_msr(index, buf) asm volatile ("rdmsr" : "=d"(buf[1]), "=a"(buf[0]) : "c"(index))
The code using this macro:
u32 buf[2];
read_msr(0x173, buf);
I found the disassembly is (using gnu toolchain):
mov eax,0x173
mov ecx,eax
rdmsr
mov DWORD PTR [rbp-0xc],edx
mov DWORD PTR [rbp-0x10],eax
The question is that 0x173 is less than 0xffff, why gcc does not use mov cx, 0x173? Will the gcc analysis the following instruction rdmsr? Will the gcc always know the correct register size?
It depends on the size of the value or variable passed.
If you pass a "short int" it will set "cx" and read the data from "ax" and "dx" (if buf is a short int, too).
For char it would access "cl" and so on.
So "c" refers to the "ecx" register, but this is accessed with "ecx", "cx", or "cl" depending on the size of the access, which I think makes sense.
To test you can try passing (unsigned short)0x173, it should change the code.
There is no analysis of the inline assembly (in fact it is after text substitution direclty copied to the output assembly, including syntax errors). Also there is no default register size, depending on whether you have a 32 or 64 bit target. This would be way to limiting.
I think the answer is because the current default data size is 32-bit. In 64-bit long mode, the default data size is also 32-bit, unless you use "rex.w" prefix.
Intel specifies the RDMSR instruction as using (all of) ECX to determine the model specific register. That being the case, and apparently as specified by your macro, GCC has every reason to load your constant into the full ECX.
So the question about why it doesn't load CX seems completely inappropriate. It looks like GCC is generating the right code.
(You didn't ask why it stages the load of ECX inefficiently by using EAX; I don't know the answer to that).

Resources