This simple code should copy the string "c" in the string "d", changing only the first char to 'x':
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb $'x', (%%ecx)\n"
"movb 1(%%ebx), %%al\n"
"movb %%al, 1(%%ecx)\n"
"movb 2(%%ebx), %%al\n"
"movb %%al, 2(%%ecx)\n"
"movb 3(%%ebx), %%al\n"
"movb %%al, 3(%%ecx)\n"
"movb $0, 4(%%ecx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("%s\n", d);
return 0;
}
But it gives a segmentation fault error. I believe my problem is with the constraints, but I can't figure how to fix this.
What is the right way, and how can I change this code to work?
Yes, the input/output operands are wrong. The format is this:
__asm__("<instructions>\n\t"
: OutputOperands
: InputOperands
: Clobbers);
You have the inputs and outputs backwards. You have c as an output, when it should be an input (since you're reading from it). You have d as an input, when it should be an output (since you're writing c to it).
Thus, your inline assembly should be written as follows:
__asm__("leal %1, %%ebx\n\t"
"leal %0, %%ecx\n\t"
"movb $'x', (%%ecx)\n\t"
"movb 1(%%ebx), %%al\n\t"
"movb %%al, 1(%%ecx)\n\t"
"movb 2(%%ebx), %%al\n\t"
"movb %%al, 2(%%ecx)\n\t"
"movb 3(%%ebx), %%al\n\t"
"movb %%al, 3(%%ecx)\n\t"
"movb $0, 4(%%ecx)"
: "=m" (d)
: "m" (c)
: "%ebx", "%ecx", "%eax"
);
But, you are also not making the most efficient use of the operands. You have several manual load operations (lea) that you've written in assembly. You don't need to write these; that's the whole point of the Gnu extended inline assembly syntax—the compiler will generate the necessary load and store instructions for you. Not only does that make the code simpler and easier to write and maintain, but it also makes it more efficient, because the compiler can better schedule/arrange the loads and stores within surrounding code, and skip the lea instructions entirely.
Making these modifications to use operands more efficiently, as well as using names for the operands to make the code easier to read, you would have:
__asm__("movb $'x', (%[d])\n\t"
"movb 1(%[c]), %%al\n\t"
"movb %%al, 1(%[d])\n\t"
"movb 2(%[c]), %%al\n\t"
"movb %%al, 2(%[d])\n\t"
"movb 3(%[c]), %%al\n\t"
"movb %%al, 3(%[d])\n\t"
"movb $0, 4(%[d])"
: "=m" (d) // dummy arg: tell the compiler we write all of d[]
: [c] "r" (c)
, [d] "r" (d)
, "m" (c) // unused dummy arg: tell the compiler we read all of c[]
: "%eax"
);
We're asking the compiler for the pointers to be in registers with the r constraint, so we can still choose the addressing mode (reg+displacement) ourselves, as in your original. This causes the compiler to implicitly generate the two required lea instructions. Not only does this make the code simpler to write, but it also lets the compiler choose which registers it wants to use, which can make the code more efficient. (For example, it needs d in %rdi as an arg for printf. Compiler-generated setup for asm statements is optimized along with normal code, so it doesn't have to repeat this work like it would if you wrote the lea explicitly in asm. Leave as much as possible to the compiler, so it can optimize away when possible.)
Note that asking for a pointer with an r constraint doesn't imply that you dereference it. Thus, we use "m" and "=m" dummy memory operands to tell the compiler what memory is read and written, so it will ensure that the contents match program-order even in a more complex case where your function is inlined into another function that modifies c[] and d[] before and after. This works well because c[] and d[] are true C arrays, with static size. It wouldn't work if they were just pointers that you got from function args. In that case, "=m" (d) would tell the compiler that the asm writes a pointer value into a memory location, not the pointed-to contents. "=m" (*d) would tell the compiler that the asm writes the first byte. As the official docs point out, you could write something ugly using a GNU C statement-expression like:
{"m"( ({ const struct { char x[5]; } *p = (const void *)c ; *p; }) )}
Or, you could instead use a "memory" clobber to tell the compiler that all memory must be in sync. With no output operands at all, the asm block would be implicitly __volatile__, which also prevents reordering. But if you had one unused dummy output to let the compiler choose a scratch register (see below), and didn't manually use __volatile__, then the compiler would prove to itself that you never use the results and optimize out the entire block of inline assembly! (It's better to tell the compiler in as much detail as possible how your asm interacts with C variables, rather than relying on __volatile__.)
Letting the compiler choose the addressing mode will work fine for us. It avoids an extra compiler-generated lea instruction ahead of the asm block, and it simplifies the constraints because we actually use the memory operands instead of separately asking for pointers in registers.
(The compiler could still have avoided an lea in the other version if it put c[] or d[] at esp+0, so the pointer-register operand could be esp).
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %%al\n\t"
"movb %%al, 1 + %[d]\n\t"
"movb 2 + %[c], %%al\n\t"
"movb %%al, 2 + %[d]\n\t"
"movb 3 + %[c], %%al\n\t"
"movb %%al, 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d) // not sure if early-clobber is needed,
// e.g. if the compiler would otherwise be allowed to put an output memory operand at the same address as an input operand.
// It's an error with gcc 4.7 and earlier, but gcc that old also doesn't accept "m"(c) as an input memory operand
: [c] "m" (c)
: "%eax"
);
See also Looping over arrays with inline assembly for more discussion of picking addressing mode yourself vs. using "m" constraints to let the compiler pick. (If you don't want to get into that level of optimization, you probably shouldn't be using inline asm in the first place.)
The compiler will turn 3 + %[c] into something like 3 + 6(%rsp), which the assembler will evaluate the same as 9(%rsp). Fortunately, it's not a syntax error if the substitution ends up producing 3 + (%rdi). (You do get a warning, though: Warning: missing operand; zero assumed).
It would also be correct to use an "o" constraint to request an "offsetable" memory operand, but all x86 addressing modes are offsetable (you can add a compile-time-constant displacement and they're still valid), so "m" should always work. (It would be nice if "o" would add an explicit 0 to avoid the assembler warning, but it doesn't).
But we're not done with possible optimizations yet. We're still forcing the compiler to clobber the eax register when we don't actually need to use that one—any general-purpose register will do. So, we introduce another output, this time a write-only (but early-clobber) temporary stored in a register:
char temp;
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %[temp]\n\t"
"movb %[temp], 1 + %[d]\n\t"
"movb 2 + %[c], %[temp]\n\t"
"movb %[temp], 2 + %[d]\n\t"
"movb 3 + %[c], %[temp]\n\t"
"movb %[temp], 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
: // no clobbers
);
The early-clobber is necessary to stop the compiler from choosing a register that is also used in the addressing-modes for c or d. The asm syntax is designed to efficiently wrap a single instruction which reads all its inputs before writing any of its outputs.
Okay, we've made the interface between the inline assembly block and the surrounding compiler-generated code pretty much optimal—but let's look at the actual assembly language instructions we're using inside of it. These are far from optimal: we're writing one byte at a time when we could be writing four bytes at a time! (And, on a 64-bit build, we could be writing eight bytes at a time, but that wouldn't help us here.) So, let's just do:
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl 1 + %[c], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
:
);
This writes the first byte (an 'x' character) into d, and then copies 4 bytes from c into d. That will include the terminating NUL character from c (automatically appended to string literals by a C or C++ compiler), so the string in d is already NUL-terminated without needing to append an additional byte.
Shorter and faster, except for the store-forwarding stall from reading the last 4 bytes of c[] right after the compiler-generated initialization code stored the first 4 bytes and then a separate byte store of the terminating 0. You wouldn't have this problem if you used static const char c[] = "abcd";, (because then it would be in static storage instead of stored to the stack with mov-immediate every time the function runs), or if c[] was a function arg that probably wasn't just written. Out-of-order execution can hide the store-forwarding stall latency, so it's probably worth it if c[] is usually not just-written.
Notice that we are not reading from the first character of c—we just offset it as part of the movl instruction. We could tell the compiler about that to allow it to optimize by moving stores to c[0] across the asm statement. We could even ask for a [cplus1] "r" (&c[1]) input operand, which would be good if we needed the address in a register. (See the original version of this answer for that.)
Since it's exactly 4 bytes, we can cast to a 4-byte integer type, rather than defining a struct with a char[4] member or something. Remember that a memory operand refers to a value in memory, so you have to dereference a pointer. Arrays are a special case: "m" (c) references the 5-byte contents of c[], not the 4 or 8-byte pointer value. But as soon as we start casting, we just have a pointer. Even a function argument like int foo(const char c[static 5]) works like a char*, not a char [5]. Anyway, the *(const uint32_t*)&c[1] is 4 bytes in memory from c[1] to c[3]. GCC warns about strict-aliasing with that cast, so maybe a struct { char c[4]; } would be better. (gcc8-snapshot 20170628 doesn't warn. Maybe the code is fine, or maybe the warning is broken in that unstable gcc version.)
// tightest constraints possible: 4 byte input memory operand, 5 byte output operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1] "m" (*(const uint32_t*)&c[1])
:
);
The code is looking pretty good now. Here's the code for the full function, as generated by GCC 6.3 on the Godbolt Compiler Explorer (with -O3 -m32 to generate 32-bit code like in the question):
subl $40, %esp
movl $1684234849, 18(%esp) # store 'abcd' into c
movb $0, 22(%esp) # NUL-terminate c
# begin inline-asm block
movb $'x', 23(%esp) # write initial 'x' into d[0]
movl 19(%esp), %eax # get 4 characters starting at c[1]
movl %eax, 1 + 23(%esp) # write those 4 characters into d, starting at d[1]
# end inline-asm block
leal 23(%esp), %eax # load address of c[1] into EAX register
pushl %eax # push address of d[0] onto stack
call puts # call 'puts' to output string. printf("%s\n", d) optimizes to this.
xorl %eax, %eax
addl $44, %esp
ret
gcc decides to save a register by delaying the lea until after the asm block. With -m64, it does lea before the asm, but it still uses a stack-pointer address instead of the register it just set up. That lets the loads/stores run without waiting for the lea to execute, but it also wastes code-size. Since lea is fast, it's not what I'd do if writing by hand.
The "r" constraint version uses two separate subl instructions to reserve stack space: subl $28, %esp before initializing c[], and subl $12, %esp right before the asm block. This is just a missed optimization by the compiler, unlike the extra lea which is unavoidable.
Notice that this is much much worse than the asm you'd get from the much more sensible:
d[0] = 'x';
memcpy(&d[1], &c[1], 4);
In that case, c[] optimizes away entirely and you get almost the same code that char d[] = "xbcd"; would produce. (See test_memcpy() in the Godbolt link above). The inline-asm version is only useful as an example or template for wrapping other memory-to-memory instruction sequences.
So how do we test that we got all the constraints right, allowing the compiler to optimize as far as correctness allows but no further? In this case, storing into c[] and d[] before and after the asm statement provides a good check. Recent gcc versions really will combine those stores into a single store either before or after if the constraints allow it. (clang won't, though.)
int optimize_test(void) {
// static // const
char c[5] = "abcd";
char d[5];
c[3] = 'O'; // **not** optimized away: part of the 32-bit input memory operand
c[0] = '0'; // merged with the c[0]='1' after the asm, because the asm doesn't read this part of c[]
d[3] = 'E'; // optimized away because the whole d[] is an output-only operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1 "m" (*(const uint32_t*)&c[1])
:
);
c[0] = '1'; // these dead stores into c[] are not optimized away, for some reason. (Even with memcpy instead of an asm statement.)
c[3] = 'M';
d[3] = 'D';
printf("%s\n", d);
return 0;
}
There are a couple of additional tweaks that you could do with the inline assembly. For example, our clobbers are telling the compiler that it cannot re-use one of the input registers for the temp register, but it actually could. But these are all pretty subtle. If you actually cared about getting the best possible code from the compiler, you'd write the above code in C like I just showed.
There are many reasons not to use inline assembly, including performance: you'll probably just defeat the compiler's ability to optimize. If the compiler isn't doing a good job somewhere (for a specific compiler version for a specific target architecture), often you can coax it into making better assembly by just changing the C source, without resorting to inline asm. (Although it's often possible for an expert that really knows what they're doing to beat the compiler, this often requires writing the entire loop in asm and requires a significant investment in time. And if you don't know what you're doing, you can easily make it slower.)
If you're interested in learning assembly language, you should be using an assembler to write the code, not a C compiler. This is all just busy-work! It took me way too long to write this answer, and had to get help from other experts to ensure that I got all of the constraints and clobbers precisely correct so as to cause optimal code to be generated, and I know what I'm doing! This would have been a 2-minute task in assembly:
lea eax, DWORD PTR [d]
lea edx, DWORD PTR [c+1]
mov BYTE PTR [eax], 'x'
mov edx, DWORD PTR [edx]
mov DWORD PTR [eax+1], edx
…and you can easily verify that it is correct!
Extra notes from #PeterCordes: If we can assume that these strings are constants/literals, then this would actually be much better:
mov DWORD PTR [d], 'xbcd' ; 0x64636278
mov BYTE PTR [d+4], 0
where d can be any addressing mode, for example [esp+6]. If we just want to pass the string to a function, writing in pure asm lets us do things like this that the compiler wouldn't, giving excellent code size and performance:
push 0 ; includes 3 extra bytes of 0 padding, but gcc was leaving more than that already
push 'xbcd' ; ESP is now pointing to the string data we just pushed
push esp ; pushes the old value. (push stack-pointer costs 1 extra uop on Intel CPUs, and AMD Ryzen, but the LEA or MOV we avoid would also be a uop).
call puts
Making the compiler store into c[] and then reloading that inside the asm statement is just silly. You could achieve this by passing in the data as a 4-byte integer with an "ri" constraint. Or maybe using if (__builtin_constant_p(data)) { } else { } to branch on whether the data was a compile-time constant or not.
If the contents of c[] aren't supposed to be a compile-time constant, and if we can assume an offset load from c[] won't cause a store-forwarding stall, the general idea of Cody's final version is good:
lea rdi, [d] ; or "mov edi, OFFSET d" if you don't need a 64-bit RIP-relative LEA for PIC code
mov edx, DWORD PTR [c+1] ; load before store to avoid any potential false dependency
mov BYTE PTR [rdi], 'x'
mov DWORD PTR [rdi+1], edx
The lea is only worth it if we need d's address in a register afterwards (which we do in this case for printf / puts). Otherwise it's better to just use [d] and [d+1], even if the addressing mode needs a 32-bit displacement. (It doesn't in this case, since c and d are both on the stack).
Or, if there's padding after d[], and targeting 64-bit, we could load 8 bytes from c (if you know that the load won't cross into another page—a cache-line split on the load or store might also make this not worth it for perf reasons):
lea rdi, [d]
mov rdx, QWORD PTR [c]
mov QWORD PTR [rdi], rdx
mov BYTE PTR [rdi], 'x' ; overlapping store: rewrite the first byte
On some CPUs, e.g. Intel since Ivy Bridge, this will be good even if c[] was just written (avoids the store-forwarding stall):
mov edx, DWORD PTR [c]
mov dl, 'x' ; modify the low byte. reading edx later will cause a partial-reg stall on older Intel CPUs
mov byte ptr[d+4], 0
mov dword ptr[d], edx
There are other ways to replace the first byte, e.g. AND and OR, which avoid problems on older Intel CPUs.
This has the advantage that reading multiple bytes at once from the start of d[] won't suffer a store-forwarding stall, since the first 4 bytes are written with a store aligned to the start of d[].
Combining both previous ideas:
mov rdx, QWORD PTR [c]
mov dl, 'x'
mov QWORD PTR [d], rdx
As usual, the optimal choice strongly depends on context (surrounding code), and on target CPU microarchitecture (Nehalem vs. Skylake vs. Silvermont vs. Bulldozer vs. Ryzen ...)
First of all, your code for string copy did not result in an exception, when I built using gcc and executed on my Windows PC. However, the string copy was not happening because your code appears to assume that register ecx points to variable d, when it actually points to variable c. The following code copies string contents of variable c to d, then replaces the first character in array d, with x. Try Compiling with gcc.
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb (%%ecx), %%al\n"
"movb %%al, (%%ebx)\n"
"movb 1(%%ecx), %%al\n"
"movb %%al, 1(%%ebx)\n"
"movb 2(%%ecx), %%al\n"
"movb %%al, 2(%%ebx)\n"
"movb 3(%%ecx), %%al\n"
"movb %%al, 3(%%ebx)\n"
"movb 4(%%ecx), %%al\n"
"movb %%al, 4(%%ebx)\n"
"movb $'x', (%%ebx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("String d is: %s\n", d);
printf("String c remains: %s\n", c);
return 0;
}
When using MinGW gcc compiler on Windows PC, the following out put is produced:
> gcc testAsm.c
> .\a.exe
String d is: xbcd
String c remains: abcd
Is it "legal" to have a gcc inline asm statement without the actual instruction?
For example, is the asm statement "legal"? Will it introduce undefined behaviour?
int main(){
int *p = something;
asm("":"=m"(p));
return 0;
}
int main(){
int *p = 0;
asm("":"=m"(p));
return 0;
}
Compiles without any errors, but it is unnecessary:
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq $0, -8(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
GCC completely ignores that empty asm statement.
In that specific example, the compiler can see that none of the outputs of the asm (ie p) are ever used. Since the asm is not volatile (and has at least 1 output), the compiler is free to completely discard the statement during optimization.
It may also be worth mentioning that on i386, all extended asm statements (ie ones with parameters) always implicitly clobber fpsr (floating point flags) and eflags (think: cc clobber). If the statement isn't completely discarded (for example if it is volatile), this might have an effect. If it does, it will at worst be a tiny loss of efficiency, not incorrect results.
So to sum up:
Yes it is legal.
No it doesn't introduce undefined behavior EXCEPT that the value of p could be undefined since you are saying that you are overwriting the contents, but you aren't actually putting anything in it.
It can, theoretically, introduce tiny inefficiencies, but it probably won't.
Could someone help me understand the assembler given in https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
It goes like this:
uint64_t msr;
asm volatile ( "rdtsc\n\t" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n\t" // Shift the upper bits left.
"or %%rdx, %0" // 'Or' in the lower bits.
: "=a" (msr)
:
: "rdx");
How is it different from:
uint64_t msr;
asm volatile ( "rdtsc\n\t"
: "=a" (msr));
Why do we need shift and or operations and what does rdx at the end do?
EDIT: added what is still unclear to the original question.
What does "\n\t" do?
What do ":" do?
delimiters output/input/clobbers...
Is rdx at the end equal to 0?
Just to recap. First line loads the timestamp in registers eax and edx. Second line shifts the value in eax and stores in rdx. Third line ors the value in edx with the value in rdx and saves it in rdx. Fourth line assigns the value in rdx to my variable. The last line sets rdx to 0.
Why are the first three lines without ":"?
They are a template. First line with ":" is output, second is optional input and third one is optional list of clobbers (changed registers).
Is a actually eax and d - edx? Is this hard-coded?
Thanks again! :)
EDIT2: Answered some of my questions...
uint64_t msr;
asm volatile ( "rdtsc\n\t" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n\t" // Shift the upper bits left.
"or %%rdx, %0" // 'Or' in the lower bits.
: "=a" (msr)
:
: "rdx");
Because the rdtsc instruction returns it's results in edx and eax, instead of a straight 64-bit register on a 64-bit machine (See the intel system's programming manual for more information; it's an x86 instruction), the 2nd
instruction shifts the rdx register to the left 32 bits so that edx will be on the upper 32 bits instead of the lower 32 bits.
"=a" (msr) will move the contents of eax into msr (the %0), i.e. into the lower 32 bits of it, so in total you have edx (higher 32 bits) and eax (lower 32 bits) into rdx which is msr.
rdx is a clobber which will represent the msr C variable.
It's similar to doing the following in C:
static inline uint64_t rdtsc(void)
{
uint32_t eax, edx;
asm volatile("rdtsc\n\t", "=a" (eax), "=d" (edx));
return (uint64_t)eax | (uint64_t)edx << 32;
}
And:
uint64_t msr;
asm volatile ( "rdtsc\n\t"
: "=a" (msr));
This one, will just give you the contents of eax into msr.
EDIT:
1) "\n\t" is for the generated assembly to look clearer and error-free, so that you don't end up with things like movl $1, %eaxmovl $2, %ebx
2) Is rdx at the end equal to 0? The left shift does this, it removes the bits that are already in rdx.
3) Is a actually eax and d - edx? Is this hard-coded? Yes, there is a table that describes what characters represents what register, e.g. "D" would be rdi, "c" would be ecx, ...
rdtsc returns timestamp in a pair of 32-bit registers (EDX and EAX). First snippet combines them into single 64-bit register (RDX) which is mapped to msr variable.
Second snippet is the wrong one. I'm not sure about what will happen: either it won't be compiled at all, or only part of msr variable will be updated.
I am trying to familiarise myself with x86 assembly using GCC's inline assembler. I'm trying to add two numbers (a and b) and store the result in c. I have four slightly different attempts, three of which work; the last doesn't produce the expected result.
The first two examples use an intermediate register, and these both work fine. The third and fourth examples try to add the two values directly without the intermediate register, but the results vary depending on the optimization level and the order in which I add the input values. What am I getting wrong?
Environment is:
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)
First, the variables are declared as follows:
int a = 4;
int b = 7;
int c;
Example 1:
asm(" movl %1,%%eax;"
" addl %2,%%eax;"
" movl %%eax,%0;"
: "=r" (c)
: "r" (a), "r" (b)
: "%eax"
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output: a=4, b=7, c=11
Example 2:
asm(" movl %2,%%eax;"
" addl %1,%%eax;"
" movl %%eax,%0;"
: "=r" (c)
: "r" (a), "r" (b)
: "%eax"
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output: a=4, b=7, c=11
Example 3:
asm(" movl %2,%0;"
" addl %1,%0;"
: "=r" (c)
: "r" (a), "r" (b)
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output with -O0: a=4, b=7, c=11
// output with -O3: a=4, b=7, c=14
Example 4:
// this one appears to calculate a+a instead of a+b
asm(" movl %1,%0;"
" addl %2,%0;"
: "=r" (c)
: "r" (a), "r" (b)
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output with -O0: a=4, b=7, c=8
// output with -O3: a=4, b=7, c=11
SOLVED. Matthew Slattery's answer is correct. Before, it was trying to reuse eax for both b and c:
movl -4(%rbp), %edx
movl -8(%rbp), %eax
movl %edx, %eax
addl %eax, %eax
With Matthew's suggested fix in place, it now uses ecx to hold c separately.
movl -4(%rbp), %edx
movl -8(%rbp), %eax
movl %edx, %ecx
addl %eax, %ecx
By default, gcc will assume that an inline asm block will finish using the input operands before updating the output operands. This means that both an input and an output may be assigned to the same register.
But, that isn't necessarily the case in your examples 3 and 4.
e.g. in example 3:
asm(" movl %2,%0;"
" addl %1,%0;"
: "=r" (c)
: "r" (a), "r" (b)
);
...you have updated c (%0) before reading a (%1). If gcc happens to assign the same register to both %0 and %1, then it will calculate c = b; c += c, and hence will fail in exactly the way you observe:
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output with -O0: a=4, b=7, c=11
// output with -O3: a=4, b=7, c=14
You can fix it by telling gcc that the output operand may be used before the inputs are consumed, by adding the "&" modifier to the operand, like this:
asm(" movl %2,%0;"
" addl %1,%0;"
: "=&r" (c)
: "r" (a), "r" (b)
);
(See "Constraint Modifier Characters" in the gcc docs.)
Hoi, i do not see a problem there and it compiles and works fine here. However, a small hint: I got confused with the unnamed variables/registers quite soon, so I decided on using named ones. The add thingy you could for instance implement like this:
static inline void atomicAdd32(volInt32 *dest, int32_t source) {
// IMPLEMENTS: add m32, r32
__asm__ __volatile__(
"lock; addl %[in], %[out]"
: [out] "+m"(*dest)
: [in] "ir"(source)//, "[out]" "m"(*dest)
);
return;
}
(you can just ignore the atomic/ lock things for now), that makes clear what happens:
1) what registers are writeable, readable or both
2) what is used (memory, registers), which can be important when it comes to performance and clock cycles, as register operations are quicker than those accessing the memory.
Cheers,
G.
P.S.: Have you checked whether your compiler rearranges code?
I am using cmpxchg (compare-and-exchange) in i686 architecture for 32 bit compare and swap as follows.
(Editor's note: the original 32-bit example was buggy, but the question isn't about it. I believe this version is safe, and as a bonus compiles correctly for x86-64 as well. Also note that inline asm isn't needed or recommended for this; __atomic_compare_exchange_n or the older __sync_bool_compare_and_swap work for int32_t or int64_t on i486 and x86-64. But this question is about doing it with inline asm, in case you still want to.)
// note that this function doesn't return the updated oldVal
static int CAS(int *ptr, int oldVal, int newVal)
{
unsigned char ret;
__asm__ __volatile__ (
" lock\n"
" cmpxchgl %[newval], %[mem]\n"
" sete %0\n"
: "=q" (ret), [mem] "+m" (*ptr), "+a" (oldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
What is the equivalent for x86_64 architecture for 64 bit compare and swap
static int CAS(long *ptr, long oldVal, long newVal)
{
unsigned char ret;
// ?
return ret;
}
The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.
There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.
But, as stated, cmpxchgq is probably the better option for 64-bit code.
If you need to cmpxchg a 16 byte object, the 64-bit version of cmpxchg8b is cmpxchg16b. It was missing from the very earliest AMD64 CPUs, so compilers won't generate it for std::atomic::compare_exchange on 16B objects unless you enable -mcx16 (for gcc). Assemblers will assemble it, though, but beware that your binary won't run on the earliest K8 CPUs. (This only applies to cmpxchg16b, not to cmpxchg8b in 64-bit mode, or to cmpxchgq).
cmpxchg8b
__forceinline int64_t interlockedCompareExchange(volatile int64_t & v,int64_t exValue,int64_t cmpValue)
{
__asm {
mov esi,v
mov ebx,dword ptr exValue
mov ecx,dword ptr exValue + 4
mov eax,dword ptr cmpValue
mov edx,dword ptr cmpValue + 4
lock cmpxchg8b qword ptr [esi]
}
}
The x64 architecture supports a 64-bit compare-exchange using the good, old cmpexch instruction. Or you could also use the somewhat more complicated cmpexch8b instruction (from the "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"):
The CMPXCHG instruction compares a
value in the AL or rAX register with
the first (destination) operand, and
sets the arithmetic flags (ZF, OF, SF,
AF, CF, PF) according to the result.
If the compared values are equal, the
source operand is loaded into the
destination operand. If they are not
equal, the first operand is loaded
into the accumulator. CMPXCHG can be
used to try to intercept a semaphore,
i.e. test if its state is free, and if
so, load a new value into the
semaphore, making its state busy. The
test and load are performed
atomically, so that concurrent
processes or threads which use the
semaphore to access a shared object
will not conflict.
The CMPXCHG8B
instruction compares the 64-bit values
in the EDX:EAX registers with a 64-bit
memory location. If the values are
equal, the zero flag (ZF) is set, and
the ECX:EBX value is copied to the
memory location. Otherwise, the ZF
flag is cleared, and the memory value
is copied to EDX:EAX.
The CMPXCHG16B
instruction compares the 128-bit value
in the RDX:RAX and RCX:RBX registers
with a 128-bit memory location. If the
values are equal, the zero flag (ZF)
is set, and the RCX:RBX value is
copied to the memory location.
Otherwise, the ZF flag is cleared, and
the memory value is copied to rDX:rAX.
Different assembler syntaxes may need to have the length of the operations specified in the instruction mnemonic if the size of the operands can't be inferred. This may be the case for GCC's inline assembler - I don't know.
usage of cmpxchg8B from AMD64 Architecture Programmer's Manual V3:
Compare EDX:EAX register to 64-bit memory location. If equal, set the zero flag (ZF) to 1 and copy the ECX:EBX register to the memory location. Otherwise,
copy the memory location to EDX:EAX and clear the zero flag.
I use cmpxchg8B to implement a simple mutex lock function in x86-64 machine. here is the code
.text
.align 8
.global mutex_lock
mutex_lock:
pushq %rbp
movq %rsp, %rbp
jmp .L1
.L1:
movl $0, %edx
movl $0, %eax
movl $0, %ecx
movl $1, %ebx
lock cmpxchg8B (%rdi)
jne .L1
popq %rbp
ret