I am trying to familiarise myself with x86 assembly using GCC's inline assembler. I'm trying to add two numbers (a and b) and store the result in c. I have four slightly different attempts, three of which work; the last doesn't produce the expected result.
The first two examples use an intermediate register, and these both work fine. The third and fourth examples try to add the two values directly without the intermediate register, but the results vary depending on the optimization level and the order in which I add the input values. What am I getting wrong?
Environment is:
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)
First, the variables are declared as follows:
int a = 4;
int b = 7;
int c;
Example 1:
asm(" movl %1,%%eax;"
" addl %2,%%eax;"
" movl %%eax,%0;"
: "=r" (c)
: "r" (a), "r" (b)
: "%eax"
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output: a=4, b=7, c=11
Example 2:
asm(" movl %2,%%eax;"
" addl %1,%%eax;"
" movl %%eax,%0;"
: "=r" (c)
: "r" (a), "r" (b)
: "%eax"
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output: a=4, b=7, c=11
Example 3:
asm(" movl %2,%0;"
" addl %1,%0;"
: "=r" (c)
: "r" (a), "r" (b)
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output with -O0: a=4, b=7, c=11
// output with -O3: a=4, b=7, c=14
Example 4:
// this one appears to calculate a+a instead of a+b
asm(" movl %1,%0;"
" addl %2,%0;"
: "=r" (c)
: "r" (a), "r" (b)
);
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output with -O0: a=4, b=7, c=8
// output with -O3: a=4, b=7, c=11
SOLVED. Matthew Slattery's answer is correct. Before, it was trying to reuse eax for both b and c:
movl -4(%rbp), %edx
movl -8(%rbp), %eax
movl %edx, %eax
addl %eax, %eax
With Matthew's suggested fix in place, it now uses ecx to hold c separately.
movl -4(%rbp), %edx
movl -8(%rbp), %eax
movl %edx, %ecx
addl %eax, %ecx
By default, gcc will assume that an inline asm block will finish using the input operands before updating the output operands. This means that both an input and an output may be assigned to the same register.
But, that isn't necessarily the case in your examples 3 and 4.
e.g. in example 3:
asm(" movl %2,%0;"
" addl %1,%0;"
: "=r" (c)
: "r" (a), "r" (b)
);
...you have updated c (%0) before reading a (%1). If gcc happens to assign the same register to both %0 and %1, then it will calculate c = b; c += c, and hence will fail in exactly the way you observe:
printf("a=%d, b=%d, c=%d\n", a, b, c);
// output with -O0: a=4, b=7, c=11
// output with -O3: a=4, b=7, c=14
You can fix it by telling gcc that the output operand may be used before the inputs are consumed, by adding the "&" modifier to the operand, like this:
asm(" movl %2,%0;"
" addl %1,%0;"
: "=&r" (c)
: "r" (a), "r" (b)
);
(See "Constraint Modifier Characters" in the gcc docs.)
Hoi, i do not see a problem there and it compiles and works fine here. However, a small hint: I got confused with the unnamed variables/registers quite soon, so I decided on using named ones. The add thingy you could for instance implement like this:
static inline void atomicAdd32(volInt32 *dest, int32_t source) {
// IMPLEMENTS: add m32, r32
__asm__ __volatile__(
"lock; addl %[in], %[out]"
: [out] "+m"(*dest)
: [in] "ir"(source)//, "[out]" "m"(*dest)
);
return;
}
(you can just ignore the atomic/ lock things for now), that makes clear what happens:
1) what registers are writeable, readable or both
2) what is used (memory, registers), which can be important when it comes to performance and clock cycles, as register operations are quicker than those accessing the memory.
Cheers,
G.
P.S.: Have you checked whether your compiler rearranges code?
Related
case 1->
int a;
std :: cout << a << endl; // prints 0
case 2->
int a;
std :: cout << &a << " " << a << endl; // 0x7ffc057370f4 32764
whenever I print address of variable, they aren't initialized to default value why is it so.
I thought value of a in case 2 is junk but every time I run the code it shows 32764,5,6,7 are these still junk values?
Variables in C++ are not initialized to a default value, hence there's no way to determine the value. You can read more about it here.
I'm afraid the accepted answer does not touch the main point of the question:
why
int a;
std :: cout << a << endl; // prints 0
always prints 0, as if a was initialized to its default value, whereas in
int a;
std :: cout << &a << " " << a << endl; // 0x7ffc057370f4 32764
the compiler produces some junk value for a.
Yes, in both cases we have an example of undefined behavior and ANY value for a is possible, so why in Case 1 there's always 0?
First of all remember that a C/C++ compiler is free to modify the source code in an arbitrary way as long as the meaning of the program remains the same. So, if you write
int a;
std :: cout << a << endl; // prints 0
the compiler is free to assume that a needs not be associated with any real RAM cells. You don't read it, nor do you write to a. So the compiler is free to allocate the memory for a in one of its registers. In such a case a has no address and is functionally equivalent to something as weird as a "named, addressless temporary". However, in Case 2 you ask the compiler to print the address of a. In such a case the compiler cannot ignore the request and generates the code for the memory where a would be allocated even though the value of a can be a junk.
The next factor is optimization. You can either switch it off completely in Debug compilation mode or turn on aggressive optimization in Release mode. So, you can expect that your simple code will behave differently whether you compile it as Debug or Release. Moreover, since it is undefined behavior, your code may run differently if compiled with different compilers or even different versions of the same compiler.
I prepared a version of your program that is a bit easier to analyze:
#include <iostream>
int f()
{
int a;
return a; // prints 0
}
int g()
{
int a;
return reinterpret_cast<long long int>(&a) + a; // prints 0
}
int main() { std::cout << f() << " " << g() << "\n"; }
Function g differs form f in that it uses the address of uninitialized variable a. I tested it in Godbolt Compiler Explorer: https://godbolt.org/z/os8b583ss You can switch there between various compilers and various optimization options. Please do experiment yourself. For Debug and gcc or clang, use -O0 or -g, for Release use -O3.
For the newest (trunk) gcc, we have the following assembly equivalent:
f():
xorl %eax, %eax
ret
g():
leaq -4(%rsp), %rax
addl -4(%rsp), %eax
ret
main:
subq $24, %rsp
xorl %esi, %esi
movl $_ZSt4cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
leaq 12(%rsp), %rsi
movl $_ZSt4cout, %edi
addl 12(%rsp), %esi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xorl %eax, %eax
addq $24, %rsp
ret
Please notice that f() was reduced to a trivial setting of the eax register to zero ( for any value of integer a, a xor a equals 0). eax is the register where this function is to return its value. Hence 0 in Release. Well, actually, no, the compiler is even smarter: it never calls f()! Instead, it zeroes the esi register that is used in a call to operator<<. Similarly, g is replaced by reading 12(%rsp), once as a value, once as the address of. This generates a random value for a and rather similar values for &a. AFIK, they're a bit randomized to make the life of hackers attacking our code harder.
Now the same code in Debug:
f():
pushq %rbp
movq %rsp, %rbp
movl -4(%rbp), %eax
popq %rbp
ret
g():
pushq %rbp
movq %rsp, %rbp
leaq -4(%rbp), %rax
movl %eax, %edx
movl -4(%rbp), %eax
addl %edx, %eax
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
call f()
movl %eax, %esi
movl $_ZSt4cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
call g()
movl %eax, %esi
movl $_ZSt4cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movl $0, %eax
popq %rbp
ret
You can now clearly see, even without knowing the 386 assembly (I don't know it either) that in Debug mode (-g) the compiler performs no optimization at all. In f() it reads a (4 bytes below the frame pointer register value, -4(%rbp)) and moves it to the "result register" eax. In g(), the same is done, but a is read once as a value and once as an address. Moreover, both f() and g() are called in main(). In this compiler mode, the program produces "random" results for a (try it yourself!).
To make things even more interesting, here's f() as compiled by clang (trunk) in Release:
f(): # #f()
retq
g(): # #g()
retq
Can you see? These function are so trivial to clang that it generated no code for them. Moreover, it did not zeroed the registers corresponding to a, so, unlike g++, clang produces a random value for a (in both Release and Debug).
You can go with your experiments even further and find that what clang produces for f depends on whether f or g is called first in main.
Now you should have a better understanding of what Undefined Behavior is.
I am running the code below and suffering from two problems:
1) The moment I change movl (to copy values from registers) to movq I face the gcc error : Error: operand size mismatch for movq. In the normal assembly I see that this was possible by adding qword prefix or likes, but that also fails to satisfy gcc
uint64_t cpuid_0(uint64_t* _rax, uint64_t* _rbx, uint64_t* _rcx, uint64_t* _rdx){
int a, b, c, d;
*_rax = 0x0;
__asm__
__volatile__
(
"movq $0, %%rax\n"
"cpuid\n"
"movl %%eax, %0\n"
"movl %%ebx, %1\n"
"movl %%ecx, %2\n"
"movl %%edx, %3\n"
: "=r" (a), "=r" (b), "=r" (c), "=r" (d)
: "0" (a)
);
*_rax=a;*_rbx=b;*_rcx=c;*_rdx=d;
return *_rax;
}
2) I want to eliminate extra copy operation so I modified my code in the constraint specification:
uint64_t cpuid_0(uint64_t* _rax, uint64_t* _rbx, uint64_t* _rcx, uint64_t* _rdx){
int a, b, c, d;
*_rax = 0x0;
__asm__
__volatile__
(
"movq $0, %%rax\n"
"cpuid\n"
"movl %%eax, %0\n"
"movl %%ebx, %1\n"
"movl %%ecx, %2\n"
"movl %%edx, %3\n"
: "+m" (*_rax), "=m" (*_rbx), "=m" (*_rcx), "=m" (_rdx)
: "0" (*_rax)
);
*_rax=a;*_rbx=b;*_rcx=c;*_rdx=d;
return *_rax;
}
This gives me a host of errors like those below:
warning: matching constraint does not allow a register
error: inconsistent operand constraints in an ‘asm’
Also, I assume __volatile__ could be removed in this small code.
It's the input "0" (*_rax) which is foxing it... it seems that "0" does not work with a "=m" memory constraint, nor with "+m". (I do not know why.)
Changing your second function to compile and work:
uint32_t cpuid_0(uint32_t* _eax, uint32_t* _ebx, uint32_t* _ecx, uint32_t* _edx)
{
__asm__
(
"mov $0, %%eax\n"
"cpuid\n"
"mov %%eax, %0\n"
"mov %%ebx, %1\n"
"mov %%ecx, %2\n"
"mov %%edx, %3\n"
: "=m" (*_eax), "=m" (*_ebx), "=m" (*_ecx), "=m" (*_edx)
: //"0" (*_eax) -- not required and throws errors !!
: "%rax", "%rbx", "%rcx", "%rdx" // ESSENTIAL "clobbers"
) ;
return *_eax ;
}
where that:
does everything as uint32_t, for consistency.
discards the redundant int a, b, c, d;
omits the "0" input, which in any case was not being used.
declares simple "=m" output for (*_eax)
"clobbers" all "%rax", "%rbx", "%rcx", "%rdx"
discards the redundant volatile.
The last is essential, because without it the compiler has no idea that those registers are affected.
The above compiles to:
push %rbx # compiler (now) knows %rbx is "clobbered"
mov %rdx,%r8 # likewise %rdx
mov %rcx,%r9 # ditto %rcx
mov $0x0,%eax # the __asm__(....
cpuid
mov %eax,(%rdi)
mov %ebx,(%rsi)
mov %ecx,(%r8)
mov %edx,(%r9) # ....) ;
mov (%rdi),%eax
pop %rbx
retq
NB: without the "clobbers" compiles to:
mov $0x0,%eax
cpuid
mov %eax,(%rdi)
mov %ebx,(%rsi)
mov %ecx,(%rdx)
mov %edx,(%rcx)
mov (%rdi),%eax
retq
which is shorter, but sadly doesn't work !!
You could also (version 2):
struct cpuid
{
uint32_t eax ;
uint32_t ebx ;
uint32_t ecx ;
uint32_t edx ;
};
uint32_t cpuid_0(struct cpuid* cid)
{
uint32_t eax ;
__asm__
(
"mov $0, %%eax\n"
"cpuid\n"
"mov %%ebx, %1\n"
"mov %%ecx, %2\n"
"mov %%edx, %3\n"
: "=a" (eax), "=m" (cid->ebx), "=m" (cid->ecx), "=m" (cid->edx)
:: "%ebx", "%ecx", "%edx"
) ;
return cid->eax = eax ;
}
which compiles to something very slightly shorter:
push %rbx
mov $0x0,%eax
cpuid
mov %ebx,0x4(%rdi)
mov %ecx,0x8(%rdi)
mov %edx,0xc(%rdi)
pop %rbx
mov %eax,(%rdi)
retq
Or you could do something more like your first version (version 3):
uint32_t cpuid_0(struct cpuid* cid)
{
uint32_t eax, ebx, ecx, edx ;
eax = 0 ;
__asm__(" cpuid\n" : "+a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx));
cid->edx = edx ;
cid->ecx = ecx ;
cid->ebx = ebx ;
return cid->eax = eax ;
}
which compiles to:
push %rbx
xor %eax,%eax
cpuid
mov %ebx,0x4(%rdi)
mov %edx,0xc(%rdi)
pop %rbx
mov %ecx,0x8(%rdi)
mov %eax,(%rdi)
retq
This version uses the "+a", "=b" etc. magic to tell the compiler to allocate specific registers to the various variables. This reduces the amount of assembler to the bare minimum, which is generally a Good Thing. [Note that the compiler knows that xor %eax,%eax is better (and shorter) than mov $0,%eax and thinks there is some advantage to doing the pop %rbx earlier.]
Better yet -- following comment by #Peter Cordes (version 4):
uint32_t cpuid_1(struct cpuid* cid)
{
__asm__
(
"xor %%eax, %%eax\n"
"cpuid\n"
: "=a" (cid->eax), "=b" (cid->ebx), "=c" (cid->ecx), "=d" (cid->edx)
) ;
return cid->eax ;
}
where the compiler figures out that cid->eax is already in %eax, and so compiles to:
push %rbx
xor %eax,%eax
cpuid
mov %ebx,0x4(%rdi)
mov %eax,(%rdi)
pop %rbx
mov %ecx,0x8(%rdi)
mov %edx,0xc(%rdi)
retq
which is the same as version 3, apart from a small difference in the order of the instructions.
FWIW: an __asm__() is defined to be:
asm asm-qualifiers (AssemblerTemplate : OutputOperands [ : InputOperands [ : Clobbers ] ] )
The key to inline assembler is to understand that the compiler:
has no idea what the AssemblerTemplate part means.
It does expand the %xx place holders, but understands nothing else.
does understand the OutputOperands, InputOperands (if any) and Clobbers (if any)...
...these tell the compiler what the assembler needs as parameters, and how to expand the various %xx.
...but these also tell the compiler what the AssemblerTemplate does, in terms that the compiler understands.
So, what the compiler understands is a sort of "data flow". It understands that the assembler takes a number of inputs, returns a number of outputs and (may) as a side effect "clobber" some registers and/or amounts of memory. Armed with this information, the compiler can integrate the "black box" assembler sequence with the code generated around it. Among other things the compiler will:
allocate registers for output and input operands
and arrange for the inputs to be in the required registers (as required).
NB: the compiler looks on the assembler as a single operation, where all inputs are consumed before any outputs are generated. If an input is not used after the __asm__() the compiler can allocate a given register as an input and as an output. Hence the need so the so-called "early clobber".
move the "black box" around wrt the surrounding code, maintaining the dependencies the assembler has on the sources of its inputs and the dependencies the code that follows has on the assembler's outputs.
discard the "black box" altogether if nothing seems to depend on its outputs !
I am learning GCC's extended inline assembly currently. I wrote an A + B function and wants to detect the ZF flag, but things behave strangely.
The compiler I use is gcc 7.3.1 on x86-64 Arch Linux.
I started from the following code, this code will correctly print the a + b.
int a, b, sum;
scanf("%d%d", &a, &b);
asm volatile (
"movl %1, %0\n"
"addl %2, %0\n"
: "=r"(sum)
: "r"(a), "r"(b)
: "cc"
);
printf("%d\n", sum);
Then I simply added a variable to check flags, it gives me wrong output.
int a, b, sum, zero;
scanf("%d%d", &a, &b);
asm volatile (
"movl %2, %0\n"
"addl %3, %0\n"
: "=r"(sum), "=#ccz"(zero)
: "r"(a), "r"(b)
: "cc"
);
printf("%d %d\n", sum, zero);
The GAS assembly output is
movl -24(%rbp), %eax # %eax = a
movl -20(%rbp), %edx # %edx = b
#APP
# 6 "main.c" 1
movl %eax, %edx
addl %edx, %edx
# 0 "" 2
#NO_APP
sete %al
movzbl %al, %eax
movl %edx, -16(%rbp) # sum = %edx
movl %eax, -12(%rbp) # zero = %eax
This time, the sum will become a + a. But when I just exchanged %2 and %3, the output will be correct a + b.
Then I checked various gcc version (It seems clang does not support it when output is a flag) on wandbox.org, from version 4.5.4 to version 4.7.4 gives the correct result a + b, and starting from version 4.8.1 the outputs are all a + a.
My question is: did I write the wrong code or is there anything wrong with gcc?
The problem is that you clobber %0 before all the inputs (%2 in your case) are consumed:
"movl %1, %0\n"
"addl %2, %0\n"
%0 is being modified by the first MOV before %2 has been consumed. It is possible for an optimizing compiler to re-use a register for an input constraint that was used for an output constraint. In your case one of the compilers chose to use the same register for %2 and %0 which caused the erroneous results.
To get around this problem of changing a register that is being modified before all the inputs are consumed is to mark the output constraint with a &. The & is a modifier denoting Early Clobber:
‘&’
Means (in a particular alternative) that this operand is an earlyclobber operand, which is written before the instruction is finished using the input operands. Therefore, this operand may not lie in a register that is read by the instruction or as part of any memory address.
‘&’ applies only to the alternative in which it is written. In constraints with multiple alternatives, sometimes one alternative requires ‘&’ while others do not. See, for example, the ‘movdf’ insn of the 68000.
A operand which is read by the instruction can be tied to an earlyclobber operand if its only use as an input occurs before the early result is written. Adding alternatives of this form often allows GCC to produce better code when only some of the read operands can be affected by the earlyclobber. See, for example, the ‘mulsi3’ insn of the ARM.
Furthermore, if the earlyclobber operand is also a read/write operand, then that operand is written only after it’s used.
‘&’ does not obviate the need to write ‘=’ or ‘+’. As earlyclobber operands are always written, a read-only earlyclobber operand is ill-formed and will be rejected by the compiler.
The change to your code would be to modify "=r"(sum) to be "=&r"(sum). This will prevent the compiler from using the register used for the output constraint for one of the input constraints.
Word of warning. GCC Inline Assembly is powerful and evil. Very easy to get wrong if you don't know what you are doing. Only use it if you must, avoid it if you can.
In the GCC cdecl calling convention, can I rely on the arguments I pushed onto the stack to be the same after the call has returned? Even when mixing ASM and C and with optimization (-O2) enabled?
In a word: No.
Consider this code:
__cdecl int foo(int a, int b)
{
a = 5;
b = 6;
return a + b;
}
int main()
{
return foo(1, 2);
}
This produced this asm output (compiled with -O0):
movl $5, 8(%ebp)
movl $6, 12(%ebp)
movl 8(%ebp), %edx
movl 12(%ebp), %eax
addl %edx, %eax
popl %ebp
ret
So it is quite possible for a __cdecl function to stomp on the stack values.
That's not even counting the possibility of inlining or other optimization magic where things may not end up on the stack in the first place.
Context
Linux 64bit. GCC 4.8.2.
Gas assembly. AT&T syntax.
I just read this answer.
The code:
int operand1, operand2, sum, accumulator;
operand1 = 10; operand2 = 15;
__asm__ volatile ("movl %1, %0\n\t"
"addl %2, %0"
: "=r" (sum) /* output operands */
: "r" (operand1), "r" (operand2) /* input operands */
: "0"); /* clobbered operands */
accumulator = sum;
__asm__ volatile ("addl %1, %0\n\t"
"addl %2, %0"
: "=r" (accumulator)
: "0" (accumulator), "r" (operand1), "r" (operand2)
: "0");
Compiled with no optimizations of course.
I made my experiments with valgrind --tool=cachegrind ./my_bin
Actually, if I replace
"0" (accumulator), "r" (operand1), "r" (operand2)
With
"0" (accumulator), "m" (operand1), "m" (operand2)
I get one less instruction == one cpu cycle saved because there is no registry manipulation
Now, replacing
"0" (accumulator), "r" (operand1), "r" (operand2)
With
"r" (accumulator), "r" (operand1), "r" (operand2)
I get 1 cpu cycle shaved as well.
So
"r" (accumulator), "m" (operand1), "m" (operand2)
Saves 2 cpu cycles.
Questions
1) Why should we use at least one register if they slow things down ? Is there really a risk of overwrite or something ?
2) Why the heck do "0" instead of "r" slows things down ? it is non logical to me since we just reference the same value (which is accumulator). GCC should not output different code ! "r" could imply choosing another register -> nonsense && slow.
Without getting into an asm tutorial, I thought it might be better to look at code generation with and without optimization. I'm using OSX, which is basically the same ABI as x86-64 Linux.
First: you're finding sum <- op1 + op2,
followed by: acc <- sum; acc <- acc + op1 + op2,
which we can just replace with: acc <- sum + op1 + op2; don't need: acc = sum;
(this was broken by the way - op1, op2 are %2, %3 respectively, and %1 'aliases' %0)
This still isn't a particularly efficient use of inline assembly, but just to fix things up a bit into something that can be examined:
int test_fn (void)
{
int op1 = 10, op2 = 15, sum, acc;
__asm__ ("movl %k1, %k0\n\taddl %k2, %k0"
: "=&r" (sum) : "r" (op1), "r" (op2));
__asm__ ("addl %k2, %k0\n\taddl %k3, %k0"
: "=r" (acc) : "0" (sum), "r" (op1), "r" (op2));
return acc;
}
Without optimization: gcc -Wall -c -S src.c (comments are mine)
pushq %rbp
movq %rsp, %rbp
movl $10, -4(%rbp) # store 10 -> mem (op1)
movl $15, -8(%rbp) # store 15 -> mem (op2)
# asm(1)
movl -4(%rbp), %edx # load op1 -> reg (%1)
movl -8(%rbp), %ecx # load op2 -> reg (%2)
movl %edx, %eax # mov %1 to %0
addl %ecx, %eax # add %2 to %0
movl %eax, -12(%rbp) # store %0 -> mem (sum)
# asm(2)
movl -12(%rbp), %eax # load sum -> reg (%1 = %0)
movl -4(%rbp), %edx # load op1 -> reg (%2)
movl -8(%rbp), %ecx # load op2 -> reg (%3)
addl %edx, %eax # add %2 to %0
addl %ecx, %eax # add %3 to %0
movl %eax, -16(%rbp) # store %0 -> mem (acc)
movl -16(%rbp), %eax # load acc -> return value.
popq %rbp
ret
The compiler has made no effort to keep intermediate results in registers. It simply saves them back to temporary memory on the stack, and loads again as needed. It's fairly easy to follow though.
Let's apply your change to asm(2) inputs: "0" (sum), "m" (op1), "m" (op2)
...
# asm(2)
movl -4(%rbp), %eax # load sum -> reg (%1 = %0)
addl -12(%rbp), %eax # add op1 (mem) to %0
addl -16(%rbp), %eax # add op2 (mem) to %0
movl %eax, -8(%rbp) # store %0 -> mem (acc)
...
The memory locations are a bit different, but that doesn't matter. The fact that there's a form of add with reg <- reg + mem means we don't need to load to a register first. So indeed it does save an instruction, but we're still reading from and writing to memory.
With optimization: gcc -Wall -O2 -c -S src.c
movl $10, %edx
movl $15, %ecx
# asm(1)
movl %edx, %eax
addl %ecx, %eax
# asm(2)
addl %edx, %eax
addl %ecx, %eax
ret
There's no memory access. Everything is done in registers. That's as fast as it gets. No cache access, no main memory, etc. If we apply the change to use "m" constraints as we did in the unoptimized case:
movl $10, -8(%rsp)
movl $15, %ecx
movl $10, %edx
movl $15, -4(%rsp)
# asm(1)
movl %edx, %eax
addl %ecx, %eax
# asm(2)
addl -8(%rsp), %eax
addl -4(%rsp), %eax
ret
We're back to forcing the use of memory. Needlessly storing and loading operands for asm(2). It's not that valgrind was wrong - just the inference that register use was responsible for slowing things down.