Related
i used C-assembly to call the (_GetStdHandle#4) function to get (output) handle then used (_WriteFile#20) function to write my string on console using handle that i got from (_GetStdHandle#4).
i used (pushl) in my source code for each function to pass the parameters but something's is wrong because (WriteFile)) function return error (6) which is invalid handle but the handle is valid ... so something's wrong with passing argument ... yes ... my problem is passing argument to (_WriteFile) function using (pushl) ... in this code, i used (g) for each argument because there is no reason to move the parameters to register then push the registers ... so i didn't used (r) but if i use (r), the program work without any problem (which mov the parameters to registers first then push the registers (which i want to push the parameters without moving them into the registers)
this code is show nothing and the problem is from (WriteFile) function and if i use (r) for (WriteFile) parameters, the print will be done but why i can't use "g" to not mov the parameters to registers ?
typedef void * HANDLE;
#define GetStdHandle(result, handle) \
__asm ( \
"pushl %1\n\t" \
"call _GetStdHandle#4" \
: "=a" (result) \
: "g" (handle))
#define WriteFile(result, handle, buf, buf_size, written_bytes) \
__asm ( \
"pushl $0\n\t" \
"pushl %1\n\t" \
"pushl %2\n\t" \
"pushl %3\n\t" \
"pushl %4\n\t" \
"call _WriteFile#20" \
: "=a" (result) \
: "g" (written_bytes), "g" (buf_size), "g" (buf), "g" (handle))
int main()
{
HANDLE handle;
int write_result;
unsigned long written_bytes;
GetStdHandle(handle, -11);
if(handle != INVALID_HANDLE_VALUE)
{
WriteFile(write_result, handle, "Hello", 5, & written_bytes);
}
return 0;
}
the Assembly code for this program is :
.file "main.c"
.def ___main; .scl 2; .type 32; .endef
.section .rdata,"dr"
LC0:
.ascii "Hello\0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB25:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $16, %esp
call ___main
/APP
pushl $-11
call _GetStdHandle#4
# 0 "" 2
/NO_APP
movl %eax, 12(%esp)
cmpl $-1, 12(%esp)
je L2
leal 4(%esp), %eax
/APP
pushl $0
pushl %eax
pushl $5
pushl $LC0
pushl 12(%esp)
call _WriteFile#20
# 0 "" 2
/NO_APP
movl %eax, 8(%esp)
L2:
movl $0, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
LFE25:
.ident "GCC: (MinGW.org GCC-6.3.0-1) 6.3.0"
what is the problem ?
I would question the need for calling the WINAPI through wrappers like this rather than calling them directly. You can declare prototypes for the stdcall calling convention with
__attribute__((stdcall))
If you don't need to use inline assembly you shouldn't. GCC's inline assembly is hard to get right. Getting it wrong can make the code appear to work until one day it doesn't, especially if optimizations are enabled. David Wohlferd has a good article on why you shouldn't use inline assembly if you don't need to.
The primary problem can be seen in this section of generated code:
pushl $0
pushl %eax
pushl $5
pushl $LC0
pushl 12(%esp)
call _WriteFile#20
GCC has computed the memory operand (handle) for the first parameter as 12(%esp) . The problem is that you have altered ESP with the previous pushes and now offset 12(%esp) is no longer where handle is.
To get around this problem you can pass memory addresses through registers or as immediates (if possible). Rather than use g constraint which includes m (memory constraints), simply use ri for registers and immediates. This prevents memory operands from being generated. If you pass pointers through registers you will also need to add the "memory" clobber.
The STDCALL(WINAPI) calling convention allows a function to destroy EAX, ECX, and EDX (AKA the volatile registers). It is possible that GetStdHandle and WriteFile will clobber ECX and EDX as well as return a value in EAX. You need to ensure that ECX and EDX are listed as clobbers as well (or have a constraint that marks it as output), otherwise the compiler may assume the values in those registers are the same before and after the inline assembly blocks are completed. If they are different it could cause subtle bugs.
With these changes your code could look something like:
#define INVALID_HANDLE_VALUE (void *)-1
typedef void *HANDLE;
#define GetStdHandle(result, handle) \
__asm ( \
"pushl %1\n\t" \
"call _GetStdHandle#4" \
: "=a" (result) \
: "g" (handle) \
: "ecx", "edx")
#define WriteFile(result, handle, buf, buf_size, written_bytes) \
__asm __volatile ( \
"pushl $0\n\t" \
"pushl %1\n\t" \
"pushl %2\n\t" \
"pushl %3\n\t" \
"pushl %4\n\t" \
"call _WriteFile#20" \
: "=a" (result) \
: "ri" (written_bytes), "ri" (buf_size), "ri" (buf), "ri" (handle) \
: "memory", "ecx", "edx")
int main()
{
HANDLE handle;
int write_result;
unsigned long written_bytes;
GetStdHandle(handle, -11);
if(handle != INVALID_HANDLE_VALUE)
{
WriteFile(write_result, handle, "Hello", 5, &written_bytes);
}
return 0;
}
Notes:
I marked the WriteFile inline assembly as __volatile so that the optimizer can't remove the entire inline assembly if it thinks result isn't being used. The compiler doesn't know that a side effect of the function is that the display is updated. Mark the function volatile to prevent the inline assembly from being removed entirely.
GetStdHandle doesn't have a problem with potential memory operands because there are no further uses of constraints after the initial push %1. The problem you are encountering is only an issue when ESP has been modified (via a PUSH/POP or change to ESP directly) and there is a possible use of a memory constraint in that inline assembly afterwards.
Hi I'm writing some interrupt processes.
I'm on Ubuntu 18.04 and using gcc -7.3.0.
Currently I use the following prefix and suffix to pack up my Interrupt Service Routine(without error code):
#define Ent_Int __asm__ __volatile__ ("\
pushq %rax\n\t\
movq %es, %rax\n\t\
pushq %rax\n\t\
movq %ds, %rax\n\t\
pushq %rax\n\t\
pushq %rbx\n\t\
pushq %rcx\n\t\
pushq %rdx\n\t\
pushq %rbp\n\t\
pushq %rdi\n\t\
pushq %rsi\n\t\
pushq %r8\n\t\
pushq %r9\n\t\
pushq %r10\n\t\
pushq %r11\n\t\
pushq %r12\n\t\
pushq %r13\n\t\
pushq %r14\n\t\
pushq %r15\n\t\
movq $0x10, %rdi\n\t\
movq %rdi, %es\n\t\
movq %rdi, %ds\n\t");
#define Ret_Int __asm__ __volatile__ ("\
popq %r15\n\t\
popq %r14\n\t\
popq %r13\n\t\
popq %r12\n\t\
popq %r11\n\t\
popq %r10\n\t\
popq %r9\n\t\
popq %r8\n\t\
popq %rsi\n\t\
popq %rdi\n\t\
popq %rbp\n\t\
popq %rdx\n\t\
popq %rcx\n\t\
popq %rbx\n\t\
popq %rax\n\t\
movq %rax, %ds\n\t\
popq %rax\n\t\
movq %rax, %es\n\t\
popq %rax\n\t\
leave\n\t\
iretq");
And the function looks like this:
void Div_Eorr(Int_Info_No_Err STK)
{
Ent_Int;
color_printk(RED,BLACK,"do_divide_error(0),ERROR_CODE:%#016lx,RSP:%#016lx,RIP:%#016lx\n", 0, STK.RSP, *((unsigned long *)(&STK)-1));
while(1);
Ret_Int;
}
However from OSDEV, I found that attribute((interrupt)) could make life easier.
However when I compile the function, there is an error message:
./OSFiles/Codes/INT.c:288:1: sorry, unimplemented: SSE instructions aren't allowed in interrupt service routine
However, there is no SSE instruction inside of the function. So I would like to ask:
1, How should I correctly compile the ISR
2, How would gcc distinguish INT with/without error code.
Would be really grateful if any one can illustrate what are the detailed differences of compiled assembly with or without interrupt attribute.
I'm trying to implement a timer routine in 32-bit assembler on OS X Lion using sigaction() & setitimer() system calls. The idea is to set a timer using setitimer() & then have the generated alarm signal invoke a handler function previously setup through sigaction(). I have such a mechanism working on Linux, but cannot seem to get it working on OS X. I know the system call convention is different between OS X & Linux, and that OS X has a 16 byte alignment requirement. Despite compensating for these, I'm still not able to get it working (usually a "Bus error: 10" error). Thinking I did something wrong with the alignment, I wrote a simple C program that does what I want & then used clang 3.2 to generate the assembly code. Then I modified the machine-generated assembly by replacing the calls to sigaction() & setitimer() with the appropriate system & int $0x80 calls, as well as stack alignment instructions. The resulting program still doesn't work.
Here is the C program, sigaction.c, that I used to generate the assembly. Note that I commented out the printf & sleep stuff so the resulting assembly code would be easier to read:
//#include <stdio.h>
#include <signal.h>
#include <sys/time.h>
struct sigaction action;
void handler(int arg) {
// printf("HERE!\n");
}
int main() {
action.__sigaction_u.__sa_handler = handler;
action.sa_mask = 0;
action.sa_flags = 0;
// printf("sigaction size: %d\n", sizeof(action));
int fd = sigaction(14, &action, 0);
struct itimerval timer;
timer.it_interval.tv_sec = 1;
timer.it_interval.tv_usec = 0;
timer.it_value.tv_sec = 1;
timer.it_value.tv_usec = 0;
// printf("itimerval size: %d\n", sizeof(timer));
fd = setitimer(0, &timer, 0);
while (1) {
// sleep(60);
}
return 0;
}
Here is the assembly code generated using "clang -arch i386 -S sigaction.c" on the above file:
.section __TEXT,__text,regular,pure_instructions
.globl _handler
.align 4, 0x90
_handler: ## #handler
## BB#0:
pushl %ebp
movl %esp, %ebp
pushl %eax
movl 8(%ebp), %eax
movl %eax, -4(%ebp)
addl $4, %esp
popl %ebp
ret
.globl _main
.align 4, 0x90
_main: ## #main
## BB#0:
pushl %ebp
movl %esp, %ebp
pushl %esi
subl $52, %esp
calll L1$pb
L1$pb:
popl %eax
movl $14, %ecx
movl L_action$non_lazy_ptr-L1$pb(%eax), %edx
movl $0, %esi
leal _handler-L1$pb(%eax), %eax
movl $0, -8(%ebp)
movl %eax, (%edx)
movl $0, 4(%edx)
movl $0, 8(%edx)
movl $14, (%esp)
movl %edx, 4(%esp)
movl $0, 8(%esp)
movl %esi, -36(%ebp) ## 4-byte Spill
movl %ecx, -40(%ebp) ## 4-byte Spill
calll _sigaction
movl $0, %ecx
leal -32(%ebp), %edx
movl %eax, -12(%ebp)
movl $1, -32(%ebp)
movl $0, -28(%ebp)
movl $1, -24(%ebp)
movl $0, -20(%ebp)
movl $0, (%esp)
movl %edx, 4(%esp)
movl $0, 8(%esp)
movl %ecx, -44(%ebp) ## 4-byte Spill
calll _setitimer
movl %eax, -12(%ebp)
LBB1_1: ## =>This Inner Loop Header: Depth=1
jmp LBB1_1
.comm _action,12,2 ## #action
.section __IMPORT,__pointers,non_lazy_symbol_pointers
L_action$non_lazy_ptr:
.indirect_symbol _action
.long 0
.subsections_via_symbols
If I compile the assembly code using "clang -arch i386 sigaction.s -o sigaction", debug it using lldb & place a breakpoint in the handler function, the handler function is indeed called every second. so I know the assembly code is correct (ditto for the C code).
Now if I replace the call to sigaction() with:
# calll _sigaction
movl $0x2e, %eax
subl $0x04, %esp
int $0x80
addl $0x04, %esp
and the call to setitimer() with:
# calll _setitimer
movl $0x53, %eax
subl $0x04, %esp
int $0x80
addl $0x04, %esp
the assembly code no longer works, and generates the same "Bus error: 10" that my hand-coded assembly code does.
I've tried removing the subl/addl instructions that I'm using to align the stack as well as changing the values to make sure the stack is aligned on 16-byte boundaries, but nothing seems to work. I either get the bus error, a segmentation fault, or the code just hangs without calling the handler function.
One thing I did notice during debugging is that the sigaction call appears to have a lengthy wrapper around the underlying system call. If you disassemble both functions from within lldb, you will see sigaction() has a lengthy wrapper but setitimer does not. Not sure this means anything, but perhaps the sigaction() wrapper is massaging the data before passing it along. I tried debugging that code, but haven't found anything definitive yet.
If anyone knows how to get the above assembly code working by replacing the sigaction() & setitimer() functions with the appropriate system calls, it would be greatly appreciated. I can then take those changes & apply them to my hand-coded routines.
Thanks.
Update: I stripped down my hand-written assembly code to a manageable size & was able to get it working using the sigaction() & setitimer() library calls, but still haven't figured out why the syscalls don't work. Here's the code (timer.s):
.globl _main
.data
.set ITIMER_REAL, 0x00
.set SIGALRM, 0x0e
.set SYS_SIGACTION, 0x2e
.set SYS_SETITIMER, 0x53
.set TRAP, 0x80
itimerval:
interval_tv_sec:
.long 0
interval_tv_usec:
.long 0
value_tv_sec:
.long 0
value_tv_usec:
.long 0
sigaction:
sa_handler:
.long handler
sa_mask:
.long 0
sa_flags:
.long 0
.text
handler:
pushl %ebp
movl %esp, %ebp
movl %ebp, %esp
popl %ebp
ret
_main:
pushl %ebp
movl %esp, %ebp
subl $0x0c, %esp
movl $SIGALRM, %ebx
movl $sigaction, %ecx
movl $0x00, %edx
pushl %edx
pushl %ecx
pushl %ebx
# subl $0x04, %esp
call _sigaction
# movl $SYS_SIGACTION, %eax
# int $0x80
addl $0x0c, %esp
# addl $0x10, %esp
movl $ITIMER_REAL, %ebx
movl $0x01, interval_tv_sec # Successive calls every 1 second
movl $0x00, interval_tv_usec
movl $0x01, value_tv_sec # Initial call in 1 second
movl $0x00, value_tv_usec
movl $itimerval, %ecx
movl $0x00, %edx
pushl %edx
pushl %ecx
pushl %ebx
# subl $0x04, %esp
call _setitimer
# movl $SYS_SETITIMER, %eax
# int $0x80
addl $0x0c, %esp
# addl $0x10, %esp
loop:
jmp loop
When compiled with "clang -arch i386 timer.s -o timer" & debugged with lldb, the handler routine is called every second. I left my efforts at making the code work with syscalls in the code - they are commented out around the sigaction() & setitimer() calls. If for no other reason than to educate myself (and others), I would still like to get the sys call version working if possible, and if not, understand the reason why it doesn't work.
Thanks again.
Update 2: I got the setitimer syscall working. Here's the modified code:
pushl %edx
pushl %ecx
pushl %ebx
subl $0x04, %esp
movl $SYS_SETITIMER, %eax
int $0x80
addl $0x10, %esp
But the same edits do not work for the sigaction sys call, which leads me back to my original conclusion - the sigaction() library function is doing something extra before making the actual syscall. This snippet from dtruss seems to suggest the same:
With sigaction() syscall (not working):
sigaction(0xE, 0x2030, 0x0) = 0 0
setitimer(0x0, 0x2020, 0x0) = 0 0
With sigaction() library call (working):
sigaction(0xE, 0xBFFFFC40, 0x0) = 0 0
setitimer(0x0, 0x2028, 0x0) = 0 0
As you can see, the 2nd argument is different between the two versions. It seems the address of the sigaction structure (0x2030) is passed directly when using the syscall, but something else is passed when using the library call. I'm guessing that the "something else" is generated in the sigaction() library function.
Update 3: I discovered that the same exact problem exists on FreeBSD 9.1. The setitimer syscall works, but the sigaction syscall does not. Like OS X, the sigaction() library call does work.
BSD has a few sigaction syscalls - so far, I've only tried the same one I was using in OS X - 0x2e. Perhaps one of the other sigaction syscalls will work. Knowing that BSD has the same behavior will make this easier to track down, as I can pull the C source code. Plus this opens the problem up to a much wider group of people who may already know what the problem is.
Based on my understanding of how syscalls work coupled with the fact that sigaction does work on Linux, I can't help but to think I am doing something wrong in my code. However, the fact that replacing the int $0x80 call with the sigaction() library function causes my code to work seems to contradict this. There is an entire chapter on assembly language programming in the FreeBSD developer manual, as well as a section on making system calls, so what I'm doing should be possible:
https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/x86-system-calls.html
Unless someone can point out what, if anything, I am doing wrong, I think the next step is for me to look at the BSD sources for sigaction(). As I mentioned previously, I looked at the disassembled version of sigaction on OS X & found it to be quite lengthy compared to other syscalls & filled with magic numbers. Hopefully looking at the C code will make clear what it's doing that causes it to work. In the end, it could be something as simple as passing in the wrong sigaction struct (there are several of them) or failing to set some bit somewhere.
I did gcc -S on the very complex program below on x86_64:
int main() {
int x = 3;
x = 5;
return 0;
}
And what I got was:
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $3, -4(%rbp)
movl $5, -4(%rbp)
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-3)"
.section .note.GNU-stack,"",#progbits
I was wondering if someone could help me understand the output or refer me to some link explaining. Specifically, What does cfi ,LFB0,LFE0 , leave mean? All I could find regarding these is this post but couldn't fully understand what it was for. Also, what does ret do in this case? I'm guessing it's returning to __libc_start_main() which in turn would call do_exit() , is that correct?
Those .cfisomething directives result in generation of additional data by the compiler. This data helps traverse the call stack when an instruction causes an exception, so the exception handler (if any) can be found and correctly executed. The call stack information is useful for debugging. This data most probably goes into a separate section of the executable. It's not inserted between the instructions of your code.
.LFsomething: are just regular labels that are probably referenced by that extra exception-related data.
leave and ret are CPU instructions.
leave is equivalent to:
movq %rbp, %rsp
popq %rbp
and it undoes the effect of these two instructions
pushq %rbp
movq %rsp, %rbp
and instructions that allocate space on the stack by subtracting something from rsp.
ret returns from the function. It pops the return address from the stack and jumps to that address. If it was __libc_start_main() that called main(), then it returns there.
.LFB0, .LFE0 are nothing but local labels.
.cfi_startproc is used at the beginning of each function and end of the function happens by .cfi_endproc.
These assembler directives help the assembler to put debugging and stack unwinding information into the executable.
the leave instruction is an x86 assembler instruction which does the work of restoring the calling function's stack frame.
And lastly after the ret instruction, the following things happen:
%rip contains return address
%rsp points at arguments pushed by caller that didn't fit in the six registers used to pass arguments on amd64 (%rdi, %rsi, %rdx, %rcx, %r8, %r9)
called function may have trashed arguments
%rax contains return value (or trash if function is void) (or %rax and %rdx contain the return value if its size is >8 bytes but <=16 bytes1)
%r10, %r11 may be trashed
%rbp, %rbx, %r12, %r13, %r14, %r15 must contain contents from time of call
Additional information can be found here (SO question) and here (standards PDFs).
Or, on 32-bit:
%eip contains return address
%esp points at arguments pushed by caller
called function may have trashed arguments
%eax contains return value (or trash if function is void)
%ecx, %edx may be trashed
%ebp, %ebx, %esi, %edi must contain contents from time of call
I've been using gcc's Intel-compatible builtins (like __sync_fetch_and_add) for quite some time, using my own atomic template. The "__sync" functions are now officially considered "legacy".
C++11 supports std::atomic<> and its descendants, so it seems reasonable to use that instead, since it makes my code standard compliant, and the compiler will produce the best code either way, in a platform independent manner, that is almost too good to be true.
Incidentally, I'd only have to text-replace atomic with std::atomic, too. There's a lot in std::atomic (re: memory models) that I don't really need, but default parameters take care of that.
Now for the bad news. As it turns out, the generated code is, from what I can tell, ... utter crap, and not even atomic at all. Even a minimum example that increments a single atomic variable and outputs it has no fewer than 5 non-inlined function calls to ___atomic_flag_for_address, ___atomic_flag_wait_explicit, and __atomic_flag_clear_explicit (fully optimized), and on the other hand, there is not a single atomic instruction in the generated executable.
What gives? There is of course always the possibility of a compiler bug, but with the huge number of reviewers and users, such rather drastic things are generally unlikely to go unnoticed. Which means, this is probably not a bug, but intended behaviour.
What is the "rationale" behind so many function calls, and how is atomicity implemented without atomicity?
As-simple-as-it-can-get example:
#include <atomic>
int main()
{
std::atomic_int a(5);
++a;
__builtin_printf("%d", (int)a);
return 0;
}
produces the following .s:
movl $5, 28(%esp) #, a._M_i
movl %eax, (%esp) # tmp64,
call ___atomic_flag_for_address #
movl $5, 4(%esp) #,
movl %eax, %ebx #, __g
movl %eax, (%esp) # __g,
call ___atomic_flag_wait_explicit #
movl %ebx, (%esp) # __g,
addl $1, 28(%esp) #, MEM[(__i_type *)&a]
movl $5, 4(%esp) #,
call _atomic_flag_clear_explicit #
movl %ebx, (%esp) # __g,
movl $5, 4(%esp) #,
call ___atomic_flag_wait_explicit #
movl 28(%esp), %esi # MEM[(const __i_type *)&a], __r
movl %ebx, (%esp) # __g,
movl $5, 4(%esp) #,
call _atomic_flag_clear_explicit #
movl $LC0, (%esp) #,
movl %esi, 4(%esp) # __r,
call _printf #
(...)
.def ___atomic_flag_for_address; .scl 2; .type 32; .endef
.def ___atomic_flag_wait_explicit; .scl 2; .type 32; .endef
.def _atomic_flag_clear_explicit; .scl 2; .type 32; .endef
... and the mentioned functions look e.g. like this in objdump:
004013c4 <__atomic_flag_for_address>:
mov 0x4(%esp),%edx
mov %edx,%ecx
shr $0x2,%ecx
mov %edx,%eax
shl $0x4,%eax
add %ecx,%eax
add %edx,%eax
mov %eax,%ecx
shr $0x7,%ecx
mov %eax,%edx
shl $0x5,%edx
add %ecx,%edx
add %edx,%eax
mov %eax,%edx
shr $0x11,%edx
add %edx,%eax
and $0xf,%eax
add $0x405020,%eax
ret
The others are somewhat simpler, but I don't find a single instruction that would really be atomic (other than some spurious xchg which are atomic on X86, but these seem to be rather NOP/padding, since it's xchg %ax,%ax following ret).
I'm absolutely not sure what such a rather complicated function is needed for, and how it's meant to make anything atomic.
It is an inadequate compiler build.
Check your c++config.h, it shoukld look like this, but it doesn't:
/* Define if builtin atomic operations for bool are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_1 1
/* Define if builtin atomic operations for short are supported on this host.
*/
#define _GLIBCXX_ATOMIC_BUILTINS_2 1
/* Define if builtin atomic operations for int are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_4 1
/* Define if builtin atomic operations for long long are supported on this
host. */
#define _GLIBCXX_ATOMIC_BUILTINS_8 1
These macros are defined or not depending on configure tests, which check host machine support for __sync_XXX functions. These tests are in libstdc++v3/acinclude.m4, AC_DEFUN([GLIBCXX_ENABLE_ATOMIC_BUILTINS] ....
On your installation, it's evident from the MEM[(__i_type *)&a] put in the assembly file by -fverbose-asm that the compiler uses macros from atomic_0.h, for example:
#define _ATOMIC_LOAD_(__a, __x) \
({typedef __typeof__(_ATOMIC_MEMBER_) __i_type; \
__i_type* __p = &_ATOMIC_MEMBER_; \
__atomic_flag_base* __g = __atomic_flag_for_address(__p); \
__atomic_flag_wait_explicit(__g, __x); \
__i_type __r = *__p; \
atomic_flag_clear_explicit(__g, __x); \
__r; })
With a properly built compiler, with your example program, c++ -m32 -std=c++0x -S -O2 -march=core2 -fverbose-asm should produce something like this:
movl $5, 28(%esp) #, a.D.5442._M_i
lock addl $1, 28(%esp) #,
mfence
movl 28(%esp), %eax # MEM[(const struct __atomic_base *)&a].D.5442._M_i, __ret
mfence
movl $.LC0, (%esp) #,
movl %eax, 4(%esp) # __ret,
call printf #
There are two implementations. One that uses the __sync primitives and one that does not. Plus a mixture of the two that only uses some of those primitives. Which is selected depends on macros _GLIBCXX_ATOMIC_BUILTINS_1, _GLIBCXX_ATOMIC_BUILTINS_2, _GLIBCXX_ATOMIC_BUILTINS_4 and _GLIBCXX_ATOMIC_BUILTINS_8.
At least the first one is needed for the mixed implementation, all are needed for the fully atomic one. It seems that whether they are defined depends on target machine (they may not be defined for -mi386 and should be defined for -mi686).