Related
This simple code should copy the string "c" in the string "d", changing only the first char to 'x':
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb $'x', (%%ecx)\n"
"movb 1(%%ebx), %%al\n"
"movb %%al, 1(%%ecx)\n"
"movb 2(%%ebx), %%al\n"
"movb %%al, 2(%%ecx)\n"
"movb 3(%%ebx), %%al\n"
"movb %%al, 3(%%ecx)\n"
"movb $0, 4(%%ecx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("%s\n", d);
return 0;
}
But it gives a segmentation fault error. I believe my problem is with the constraints, but I can't figure how to fix this.
What is the right way, and how can I change this code to work?
Yes, the input/output operands are wrong. The format is this:
__asm__("<instructions>\n\t"
: OutputOperands
: InputOperands
: Clobbers);
You have the inputs and outputs backwards. You have c as an output, when it should be an input (since you're reading from it). You have d as an input, when it should be an output (since you're writing c to it).
Thus, your inline assembly should be written as follows:
__asm__("leal %1, %%ebx\n\t"
"leal %0, %%ecx\n\t"
"movb $'x', (%%ecx)\n\t"
"movb 1(%%ebx), %%al\n\t"
"movb %%al, 1(%%ecx)\n\t"
"movb 2(%%ebx), %%al\n\t"
"movb %%al, 2(%%ecx)\n\t"
"movb 3(%%ebx), %%al\n\t"
"movb %%al, 3(%%ecx)\n\t"
"movb $0, 4(%%ecx)"
: "=m" (d)
: "m" (c)
: "%ebx", "%ecx", "%eax"
);
But, you are also not making the most efficient use of the operands. You have several manual load operations (lea) that you've written in assembly. You don't need to write these; that's the whole point of the Gnu extended inline assembly syntax—the compiler will generate the necessary load and store instructions for you. Not only does that make the code simpler and easier to write and maintain, but it also makes it more efficient, because the compiler can better schedule/arrange the loads and stores within surrounding code, and skip the lea instructions entirely.
Making these modifications to use operands more efficiently, as well as using names for the operands to make the code easier to read, you would have:
__asm__("movb $'x', (%[d])\n\t"
"movb 1(%[c]), %%al\n\t"
"movb %%al, 1(%[d])\n\t"
"movb 2(%[c]), %%al\n\t"
"movb %%al, 2(%[d])\n\t"
"movb 3(%[c]), %%al\n\t"
"movb %%al, 3(%[d])\n\t"
"movb $0, 4(%[d])"
: "=m" (d) // dummy arg: tell the compiler we write all of d[]
: [c] "r" (c)
, [d] "r" (d)
, "m" (c) // unused dummy arg: tell the compiler we read all of c[]
: "%eax"
);
We're asking the compiler for the pointers to be in registers with the r constraint, so we can still choose the addressing mode (reg+displacement) ourselves, as in your original. This causes the compiler to implicitly generate the two required lea instructions. Not only does this make the code simpler to write, but it also lets the compiler choose which registers it wants to use, which can make the code more efficient. (For example, it needs d in %rdi as an arg for printf. Compiler-generated setup for asm statements is optimized along with normal code, so it doesn't have to repeat this work like it would if you wrote the lea explicitly in asm. Leave as much as possible to the compiler, so it can optimize away when possible.)
Note that asking for a pointer with an r constraint doesn't imply that you dereference it. Thus, we use "m" and "=m" dummy memory operands to tell the compiler what memory is read and written, so it will ensure that the contents match program-order even in a more complex case where your function is inlined into another function that modifies c[] and d[] before and after. This works well because c[] and d[] are true C arrays, with static size. It wouldn't work if they were just pointers that you got from function args. In that case, "=m" (d) would tell the compiler that the asm writes a pointer value into a memory location, not the pointed-to contents. "=m" (*d) would tell the compiler that the asm writes the first byte. As the official docs point out, you could write something ugly using a GNU C statement-expression like:
{"m"( ({ const struct { char x[5]; } *p = (const void *)c ; *p; }) )}
Or, you could instead use a "memory" clobber to tell the compiler that all memory must be in sync. With no output operands at all, the asm block would be implicitly __volatile__, which also prevents reordering. But if you had one unused dummy output to let the compiler choose a scratch register (see below), and didn't manually use __volatile__, then the compiler would prove to itself that you never use the results and optimize out the entire block of inline assembly! (It's better to tell the compiler in as much detail as possible how your asm interacts with C variables, rather than relying on __volatile__.)
Letting the compiler choose the addressing mode will work fine for us. It avoids an extra compiler-generated lea instruction ahead of the asm block, and it simplifies the constraints because we actually use the memory operands instead of separately asking for pointers in registers.
(The compiler could still have avoided an lea in the other version if it put c[] or d[] at esp+0, so the pointer-register operand could be esp).
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %%al\n\t"
"movb %%al, 1 + %[d]\n\t"
"movb 2 + %[c], %%al\n\t"
"movb %%al, 2 + %[d]\n\t"
"movb 3 + %[c], %%al\n\t"
"movb %%al, 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d) // not sure if early-clobber is needed,
// e.g. if the compiler would otherwise be allowed to put an output memory operand at the same address as an input operand.
// It's an error with gcc 4.7 and earlier, but gcc that old also doesn't accept "m"(c) as an input memory operand
: [c] "m" (c)
: "%eax"
);
See also Looping over arrays with inline assembly for more discussion of picking addressing mode yourself vs. using "m" constraints to let the compiler pick. (If you don't want to get into that level of optimization, you probably shouldn't be using inline asm in the first place.)
The compiler will turn 3 + %[c] into something like 3 + 6(%rsp), which the assembler will evaluate the same as 9(%rsp). Fortunately, it's not a syntax error if the substitution ends up producing 3 + (%rdi). (You do get a warning, though: Warning: missing operand; zero assumed).
It would also be correct to use an "o" constraint to request an "offsetable" memory operand, but all x86 addressing modes are offsetable (you can add a compile-time-constant displacement and they're still valid), so "m" should always work. (It would be nice if "o" would add an explicit 0 to avoid the assembler warning, but it doesn't).
But we're not done with possible optimizations yet. We're still forcing the compiler to clobber the eax register when we don't actually need to use that one—any general-purpose register will do. So, we introduce another output, this time a write-only (but early-clobber) temporary stored in a register:
char temp;
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %[temp]\n\t"
"movb %[temp], 1 + %[d]\n\t"
"movb 2 + %[c], %[temp]\n\t"
"movb %[temp], 2 + %[d]\n\t"
"movb 3 + %[c], %[temp]\n\t"
"movb %[temp], 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
: // no clobbers
);
The early-clobber is necessary to stop the compiler from choosing a register that is also used in the addressing-modes for c or d. The asm syntax is designed to efficiently wrap a single instruction which reads all its inputs before writing any of its outputs.
Okay, we've made the interface between the inline assembly block and the surrounding compiler-generated code pretty much optimal—but let's look at the actual assembly language instructions we're using inside of it. These are far from optimal: we're writing one byte at a time when we could be writing four bytes at a time! (And, on a 64-bit build, we could be writing eight bytes at a time, but that wouldn't help us here.) So, let's just do:
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl 1 + %[c], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
:
);
This writes the first byte (an 'x' character) into d, and then copies 4 bytes from c into d. That will include the terminating NUL character from c (automatically appended to string literals by a C or C++ compiler), so the string in d is already NUL-terminated without needing to append an additional byte.
Shorter and faster, except for the store-forwarding stall from reading the last 4 bytes of c[] right after the compiler-generated initialization code stored the first 4 bytes and then a separate byte store of the terminating 0. You wouldn't have this problem if you used static const char c[] = "abcd";, (because then it would be in static storage instead of stored to the stack with mov-immediate every time the function runs), or if c[] was a function arg that probably wasn't just written. Out-of-order execution can hide the store-forwarding stall latency, so it's probably worth it if c[] is usually not just-written.
Notice that we are not reading from the first character of c—we just offset it as part of the movl instruction. We could tell the compiler about that to allow it to optimize by moving stores to c[0] across the asm statement. We could even ask for a [cplus1] "r" (&c[1]) input operand, which would be good if we needed the address in a register. (See the original version of this answer for that.)
Since it's exactly 4 bytes, we can cast to a 4-byte integer type, rather than defining a struct with a char[4] member or something. Remember that a memory operand refers to a value in memory, so you have to dereference a pointer. Arrays are a special case: "m" (c) references the 5-byte contents of c[], not the 4 or 8-byte pointer value. But as soon as we start casting, we just have a pointer. Even a function argument like int foo(const char c[static 5]) works like a char*, not a char [5]. Anyway, the *(const uint32_t*)&c[1] is 4 bytes in memory from c[1] to c[3]. GCC warns about strict-aliasing with that cast, so maybe a struct { char c[4]; } would be better. (gcc8-snapshot 20170628 doesn't warn. Maybe the code is fine, or maybe the warning is broken in that unstable gcc version.)
// tightest constraints possible: 4 byte input memory operand, 5 byte output operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1] "m" (*(const uint32_t*)&c[1])
:
);
The code is looking pretty good now. Here's the code for the full function, as generated by GCC 6.3 on the Godbolt Compiler Explorer (with -O3 -m32 to generate 32-bit code like in the question):
subl $40, %esp
movl $1684234849, 18(%esp) # store 'abcd' into c
movb $0, 22(%esp) # NUL-terminate c
# begin inline-asm block
movb $'x', 23(%esp) # write initial 'x' into d[0]
movl 19(%esp), %eax # get 4 characters starting at c[1]
movl %eax, 1 + 23(%esp) # write those 4 characters into d, starting at d[1]
# end inline-asm block
leal 23(%esp), %eax # load address of c[1] into EAX register
pushl %eax # push address of d[0] onto stack
call puts # call 'puts' to output string. printf("%s\n", d) optimizes to this.
xorl %eax, %eax
addl $44, %esp
ret
gcc decides to save a register by delaying the lea until after the asm block. With -m64, it does lea before the asm, but it still uses a stack-pointer address instead of the register it just set up. That lets the loads/stores run without waiting for the lea to execute, but it also wastes code-size. Since lea is fast, it's not what I'd do if writing by hand.
The "r" constraint version uses two separate subl instructions to reserve stack space: subl $28, %esp before initializing c[], and subl $12, %esp right before the asm block. This is just a missed optimization by the compiler, unlike the extra lea which is unavoidable.
Notice that this is much much worse than the asm you'd get from the much more sensible:
d[0] = 'x';
memcpy(&d[1], &c[1], 4);
In that case, c[] optimizes away entirely and you get almost the same code that char d[] = "xbcd"; would produce. (See test_memcpy() in the Godbolt link above). The inline-asm version is only useful as an example or template for wrapping other memory-to-memory instruction sequences.
So how do we test that we got all the constraints right, allowing the compiler to optimize as far as correctness allows but no further? In this case, storing into c[] and d[] before and after the asm statement provides a good check. Recent gcc versions really will combine those stores into a single store either before or after if the constraints allow it. (clang won't, though.)
int optimize_test(void) {
// static // const
char c[5] = "abcd";
char d[5];
c[3] = 'O'; // **not** optimized away: part of the 32-bit input memory operand
c[0] = '0'; // merged with the c[0]='1' after the asm, because the asm doesn't read this part of c[]
d[3] = 'E'; // optimized away because the whole d[] is an output-only operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1 "m" (*(const uint32_t*)&c[1])
:
);
c[0] = '1'; // these dead stores into c[] are not optimized away, for some reason. (Even with memcpy instead of an asm statement.)
c[3] = 'M';
d[3] = 'D';
printf("%s\n", d);
return 0;
}
There are a couple of additional tweaks that you could do with the inline assembly. For example, our clobbers are telling the compiler that it cannot re-use one of the input registers for the temp register, but it actually could. But these are all pretty subtle. If you actually cared about getting the best possible code from the compiler, you'd write the above code in C like I just showed.
There are many reasons not to use inline assembly, including performance: you'll probably just defeat the compiler's ability to optimize. If the compiler isn't doing a good job somewhere (for a specific compiler version for a specific target architecture), often you can coax it into making better assembly by just changing the C source, without resorting to inline asm. (Although it's often possible for an expert that really knows what they're doing to beat the compiler, this often requires writing the entire loop in asm and requires a significant investment in time. And if you don't know what you're doing, you can easily make it slower.)
If you're interested in learning assembly language, you should be using an assembler to write the code, not a C compiler. This is all just busy-work! It took me way too long to write this answer, and had to get help from other experts to ensure that I got all of the constraints and clobbers precisely correct so as to cause optimal code to be generated, and I know what I'm doing! This would have been a 2-minute task in assembly:
lea eax, DWORD PTR [d]
lea edx, DWORD PTR [c+1]
mov BYTE PTR [eax], 'x'
mov edx, DWORD PTR [edx]
mov DWORD PTR [eax+1], edx
…and you can easily verify that it is correct!
Extra notes from #PeterCordes: If we can assume that these strings are constants/literals, then this would actually be much better:
mov DWORD PTR [d], 'xbcd' ; 0x64636278
mov BYTE PTR [d+4], 0
where d can be any addressing mode, for example [esp+6]. If we just want to pass the string to a function, writing in pure asm lets us do things like this that the compiler wouldn't, giving excellent code size and performance:
push 0 ; includes 3 extra bytes of 0 padding, but gcc was leaving more than that already
push 'xbcd' ; ESP is now pointing to the string data we just pushed
push esp ; pushes the old value. (push stack-pointer costs 1 extra uop on Intel CPUs, and AMD Ryzen, but the LEA or MOV we avoid would also be a uop).
call puts
Making the compiler store into c[] and then reloading that inside the asm statement is just silly. You could achieve this by passing in the data as a 4-byte integer with an "ri" constraint. Or maybe using if (__builtin_constant_p(data)) { } else { } to branch on whether the data was a compile-time constant or not.
If the contents of c[] aren't supposed to be a compile-time constant, and if we can assume an offset load from c[] won't cause a store-forwarding stall, the general idea of Cody's final version is good:
lea rdi, [d] ; or "mov edi, OFFSET d" if you don't need a 64-bit RIP-relative LEA for PIC code
mov edx, DWORD PTR [c+1] ; load before store to avoid any potential false dependency
mov BYTE PTR [rdi], 'x'
mov DWORD PTR [rdi+1], edx
The lea is only worth it if we need d's address in a register afterwards (which we do in this case for printf / puts). Otherwise it's better to just use [d] and [d+1], even if the addressing mode needs a 32-bit displacement. (It doesn't in this case, since c and d are both on the stack).
Or, if there's padding after d[], and targeting 64-bit, we could load 8 bytes from c (if you know that the load won't cross into another page—a cache-line split on the load or store might also make this not worth it for perf reasons):
lea rdi, [d]
mov rdx, QWORD PTR [c]
mov QWORD PTR [rdi], rdx
mov BYTE PTR [rdi], 'x' ; overlapping store: rewrite the first byte
On some CPUs, e.g. Intel since Ivy Bridge, this will be good even if c[] was just written (avoids the store-forwarding stall):
mov edx, DWORD PTR [c]
mov dl, 'x' ; modify the low byte. reading edx later will cause a partial-reg stall on older Intel CPUs
mov byte ptr[d+4], 0
mov dword ptr[d], edx
There are other ways to replace the first byte, e.g. AND and OR, which avoid problems on older Intel CPUs.
This has the advantage that reading multiple bytes at once from the start of d[] won't suffer a store-forwarding stall, since the first 4 bytes are written with a store aligned to the start of d[].
Combining both previous ideas:
mov rdx, QWORD PTR [c]
mov dl, 'x'
mov QWORD PTR [d], rdx
As usual, the optimal choice strongly depends on context (surrounding code), and on target CPU microarchitecture (Nehalem vs. Skylake vs. Silvermont vs. Bulldozer vs. Ryzen ...)
First of all, your code for string copy did not result in an exception, when I built using gcc and executed on my Windows PC. However, the string copy was not happening because your code appears to assume that register ecx points to variable d, when it actually points to variable c. The following code copies string contents of variable c to d, then replaces the first character in array d, with x. Try Compiling with gcc.
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb (%%ecx), %%al\n"
"movb %%al, (%%ebx)\n"
"movb 1(%%ecx), %%al\n"
"movb %%al, 1(%%ebx)\n"
"movb 2(%%ecx), %%al\n"
"movb %%al, 2(%%ebx)\n"
"movb 3(%%ecx), %%al\n"
"movb %%al, 3(%%ebx)\n"
"movb 4(%%ecx), %%al\n"
"movb %%al, 4(%%ebx)\n"
"movb $'x', (%%ebx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("String d is: %s\n", d);
printf("String c remains: %s\n", c);
return 0;
}
When using MinGW gcc compiler on Windows PC, the following out put is produced:
> gcc testAsm.c
> .\a.exe
String d is: xbcd
String c remains: abcd
What is the correct use of multiple input and output operands in extended GCC asm under register constraint? Consider this minimal version of my problem. The following brief extended asm code in GCC, AT&T syntax:
int input0 = 10;
int input1 = 15;
int output0 = 0;
int output1 = 1;
asm volatile("mov %[input0], %[output0]\t\n"
"mov %[input1], %[output1]\t\n"
: [output0] "=r" (output0), [output1] "=r" (output1)
: [input0] "r" (input0), [input1] "r" (input1)
:);
printf("output0: %d\n", output0);
printf("output1: %d\n", output1);
The syntax appears correct based on https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html However, I must have overlooked something or be committing some trivial mistake that I for some reason can't see.
The output with GCC 5.3.0 p1.0 (no compiler arguments) is:
output0: 10
output1: 10
Expected output is:
output0: 10
output1: 15
Looking at it in GDB shows:
0x0000000000400581 <+43>: mov eax,DWORD PTR [rbp-0x10]
0x0000000000400584 <+46>: mov edx,DWORD PTR [rbp-0xc]
0x0000000000400587 <+49>: mov edx,eax
0x0000000000400589 <+51>: mov eax,edx
0x000000000040058b <+53>: mov DWORD PTR [rbp-0x8],edx
0x000000000040058e <+56>: mov DWORD PTR [rbp-0x4],eax
From what I can see it loads eax with input0 and edx with input1. It then overwrites edx with eax and eax with edx, making these equal. It then writes these back into output0 and output1.
If I use a memory constraint (=m) instead of a register constraint (=r) for the output, it gives the expected output and the assembly looks more reasonable.
The problem is that GCC assumes that all all output operands are only written at the end of the instruction, after all input operands have been consumed. This means it can use the same operand (eg. a register) as an input operand and an output operand which is what is happening here. The solution is to mark [output0] with an early clobber constraint so that GCC knows that its written to before the end of the asm statement.
For example:
asm volatile("mov %[input0], %[output0]\t\n"
"mov %[input1], %[output1]\t\n"
: [output0] "=&r" (output0), [output1] "=r" (output1)
: [input0] "r" (input0), [input1] "r" (input1)
:);
You don't need to mark [output1] as early clobber because it's only written to at the end of the instruction so it doesn't matter if it uses the same register as [input0] or [input1].
I have an asm loop guaranteed not to go over 128 iterations that I want to unroll with a PC-relative jump. The idea is to unroll each iteration in reverse order and then jump however far into the loop it needs to be. The code would look like this:
#define __mul(i) \
"movq -"#i"(%3,%5,8),%%rax;" \
"mulq "#i"(%4,%6,8);" \
"addq %%rax,%0;" \
"adcq %%rdx,%1;" \
"adcq $0,%2;"
asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out
__mul(127)
__mul(126)
__mul(125)
...
__mul(1)
__mul(0)
: "+r"(lo),"+r"(hi),"+r"(overflow)
: "r"(a.data),"r"(b.data),"r"(i-k),"r"(k)
: "%rax","%rdx");
Is something like this possible with gcc inline assembly?
In gcc inline assembly, you can use labels and have the assembler sort out the jump target for you. Something like (contrived example):
int max(int a, int b)
{
int result;
__asm__ __volatile__(
"movl %1, %0\n"
"cmpl %2, %0\n"
"jeq a_is_larger\n"
"movl %2, %0\n"
"a_is_larger:\n" : "=r"(result), "r"(a), "r"(b));
return (result);
}
That's one thing. The other thing you could do to avoid multiplication is to make the assembler align the blocks for you, say, at a multiple of 32 bytes (I don't think the instruction sequence fits into 16 Bytes), like:
#define mul(i) \
".align 32\n" \
".Lmul" #i ":\n" \
"movq -" #i "(%3,%5,8),%%rax\n"\
"mulq " #i "(%4,%6,8)\n" \
"addq %%rax,%0\n" \
"adcq %%rdx,%1\n" \
"adcq $0,%2\n"
This will simply pad the instruction stream with nop. If yo do choose not to align these blocks, you can still, in your main expression, use the generated local labels to find the size of the assembly blocks:
#ifdef UNALIGNED
__asm__ ("imul $(.Lmul0-.Lmul1), %[label]\n"
#else
__asm__ ("shlq $5, %[label]\n"
#endif
"leaq .Lmulblkstart, %[dummy]\n" /* this is PC-relative in 64bit */
"jmp (%[dummy], %[label])\n"
".align 32\n"
".Lmulblkstart:\n"
__mul(127)
...
__mul(0)
: ... [dummy]"=r"(dummy) : [label]"r"((128-count)))
And for the case where count is a compile-time constant, you can even do:
__asm__("jmp .Lmul" #count "\n" ...);
Little note on the end:
Aligning the blocks is a good idea if the autogenerated _mul() thing can create sequences of different lengths. For constants 0..127 as you use, that won't be the case as they all fit into a byte, but if you'll scale them larger it would go to 16- or 32-bit values and the instruction block would grow alongside. By padding the instruction stream, the jumptable technique can still be used.
This isn't a direct answer, but have you considered using a variant of
Duff's Device instead of inline
assembly? That would take the form of switch statement:
switch(iterations) {
case 128: /* code for i=128 here */
case 127: /* code for i=127 here */
case 126: /* code for i=126 here */
/* ... */
case 1: /* code for i=1 here*/
break;
default: die("too many cases");
}
Sorry I can't provide the answer in ATT syntax, I hope you can easily perform the translations.
If you have the count in RCX and you can have a label just after __mul(0) then you could do this:
; rcx must be in [0..128] range.
imul ecx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
lea rcx, [rcx + the_label] ; There is no memory read here
jmp rcx
Hope this helps.
EDIT:
I made a mistake yesterday. I've assumed that referencing a label in [rcx + the_label] is resolved as [rcx + rip + disp] but it is not since there is no such addressing mode (only [rip + disp32] exists)
This code should work and additionally it will left rcx untouched and will destroy rax and rdx instead (but your code seems to not read them before writing to them first):
; rcx must be in [0..128] range.
imul edx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
lea rax, [the_label] ; PC-relative addressing (There is no memory read here)
add rax, rdx
jmp rax
If I have the following C++ code to compare two 128-bit unsigned integers, with inline amd-64 asm:
struct uint128_t {
uint64_t lo, hi;
};
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
uint64_t temp;
bool result;
__asm__(
"cmpq %3, %2;"
"sbbq %4, %1;"
"setc %0;"
: // outputs:
/*0*/"=r,1,2"(result),
/*1*/"=r,r,r"(temp)
: // inputs:
/*2*/"r,r,r"(a.lo),
/*3*/"emr,emr,emr"(b.lo),
/*4*/"emr,emr,emr"(b.hi),
"1"(a.hi));
return result;
}
Then it will be inlined quite efficiently, but with one flaw. The return value is done through the "interface" of a general register with a value of 0 or 1. This adds two or three unnecessary extra instructions and detracts from a compare operation that would otherwise be fully optimized. The generated code will look something like this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
setc al
movzx eax, al
test eax, eax
jnz is_lessthan
If I use "sbb %0,%0" with an "int" return value instead of "setc %0" with a "bool" return value, there's still two extra instructions:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
sbb eax, eax
test eax, eax
jnz is_lessthan
What I want is this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
jc is_lessthan
GCC extended inline asm is wonderful, otherwise. But I want it to be just as good as an intrinsic function would be, in every way. I want to be able to directly return a boolean value in the form of the state of a CPU flag or flags, without having to "render" it into a general register.
Is this possible, or would GCC (and the Intel C++ compiler, which also allows this form of inline asm to be used) have to be modified or even refactored to make it possible?
Also, while I'm at it — is there any other way my formulation of the compare operator could be improved?
Here we are almost 7 years later, and YES, gcc finally added support for "outputting flags" (added in 6.1.0, released ~April 2016). The detailed docs are here, but in short, it looks like this:
/* Test if bit 0 is set in 'value' */
char a;
asm("bt $0, %1"
: "=#ccc" (a)
: "r" (value) );
if (a)
blah;
To understand =#ccc: The output constraint (which requires =) is of type #cc followed by the condition code to use (in this case c to reference the carry flag).
Ok, this may not be an issue for your specific case anymore (since gcc now supports comparing 128bit data types directly), but (currently) 1,326 people have viewed this question. Apparently there's some interest in this feature.
Now I personally favor the school of thought that says don't use inline asm at all. But if you must, yes you can (now) 'output' flags.
FWIW.
I don't know a way to do this. You may or may not consider this an improvement:
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
register uint64_t temp = a.hi;
__asm__(
"cmpq %2, %1;"
"sbbq $0, %0;"
: // outputs:
/*0*/"=r"(temp)
: // inputs:
/*1*/"r"(a.lo),
/*2*/"mr"(b.lo),
"0"(temp));
return temp < b.hi;
}
It produces something like:
mov rdx, [r14]
mov rax, [r14+8]
cmp rdx, [r15]
sbb rax, 0
cmp rax, [r15+8]
jc is_lessthan
I am using cmpxchg (compare-and-exchange) in i686 architecture for 32 bit compare and swap as follows.
(Editor's note: the original 32-bit example was buggy, but the question isn't about it. I believe this version is safe, and as a bonus compiles correctly for x86-64 as well. Also note that inline asm isn't needed or recommended for this; __atomic_compare_exchange_n or the older __sync_bool_compare_and_swap work for int32_t or int64_t on i486 and x86-64. But this question is about doing it with inline asm, in case you still want to.)
// note that this function doesn't return the updated oldVal
static int CAS(int *ptr, int oldVal, int newVal)
{
unsigned char ret;
__asm__ __volatile__ (
" lock\n"
" cmpxchgl %[newval], %[mem]\n"
" sete %0\n"
: "=q" (ret), [mem] "+m" (*ptr), "+a" (oldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
What is the equivalent for x86_64 architecture for 64 bit compare and swap
static int CAS(long *ptr, long oldVal, long newVal)
{
unsigned char ret;
// ?
return ret;
}
The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.
There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.
But, as stated, cmpxchgq is probably the better option for 64-bit code.
If you need to cmpxchg a 16 byte object, the 64-bit version of cmpxchg8b is cmpxchg16b. It was missing from the very earliest AMD64 CPUs, so compilers won't generate it for std::atomic::compare_exchange on 16B objects unless you enable -mcx16 (for gcc). Assemblers will assemble it, though, but beware that your binary won't run on the earliest K8 CPUs. (This only applies to cmpxchg16b, not to cmpxchg8b in 64-bit mode, or to cmpxchgq).
cmpxchg8b
__forceinline int64_t interlockedCompareExchange(volatile int64_t & v,int64_t exValue,int64_t cmpValue)
{
__asm {
mov esi,v
mov ebx,dword ptr exValue
mov ecx,dword ptr exValue + 4
mov eax,dword ptr cmpValue
mov edx,dword ptr cmpValue + 4
lock cmpxchg8b qword ptr [esi]
}
}
The x64 architecture supports a 64-bit compare-exchange using the good, old cmpexch instruction. Or you could also use the somewhat more complicated cmpexch8b instruction (from the "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"):
The CMPXCHG instruction compares a
value in the AL or rAX register with
the first (destination) operand, and
sets the arithmetic flags (ZF, OF, SF,
AF, CF, PF) according to the result.
If the compared values are equal, the
source operand is loaded into the
destination operand. If they are not
equal, the first operand is loaded
into the accumulator. CMPXCHG can be
used to try to intercept a semaphore,
i.e. test if its state is free, and if
so, load a new value into the
semaphore, making its state busy. The
test and load are performed
atomically, so that concurrent
processes or threads which use the
semaphore to access a shared object
will not conflict.
The CMPXCHG8B
instruction compares the 64-bit values
in the EDX:EAX registers with a 64-bit
memory location. If the values are
equal, the zero flag (ZF) is set, and
the ECX:EBX value is copied to the
memory location. Otherwise, the ZF
flag is cleared, and the memory value
is copied to EDX:EAX.
The CMPXCHG16B
instruction compares the 128-bit value
in the RDX:RAX and RCX:RBX registers
with a 128-bit memory location. If the
values are equal, the zero flag (ZF)
is set, and the RCX:RBX value is
copied to the memory location.
Otherwise, the ZF flag is cleared, and
the memory value is copied to rDX:rAX.
Different assembler syntaxes may need to have the length of the operations specified in the instruction mnemonic if the size of the operands can't be inferred. This may be the case for GCC's inline assembler - I don't know.
usage of cmpxchg8B from AMD64 Architecture Programmer's Manual V3:
Compare EDX:EAX register to 64-bit memory location. If equal, set the zero flag (ZF) to 1 and copy the ECX:EBX register to the memory location. Otherwise,
copy the memory location to EDX:EAX and clear the zero flag.
I use cmpxchg8B to implement a simple mutex lock function in x86-64 machine. here is the code
.text
.align 8
.global mutex_lock
mutex_lock:
pushq %rbp
movq %rsp, %rbp
jmp .L1
.L1:
movl $0, %edx
movl $0, %eax
movl $0, %ecx
movl $1, %ebx
lock cmpxchg8B (%rdi)
jne .L1
popq %rbp
ret