How do I return a byte or short from inline assembly

How do I return a byte or short from inline assembly - gcc

When I compile:
short foo (short x)
{
__asm ("mov %0, 0" : "=r" (x));
return x;
}
I get
foo:
mov r0, 0
sxth r0, r0
bx lr
I don't want the extra instruction sxth (signed extract halfword).
The reason it is there is because the ABI (ARM AEABI in this case) requires that the function return value is sign extended to a full register, but GCC doesn't know what my assembly code might have done to the register it gave me.
How do I tell GCC that I promise to leave the register correctly sign extended so it doesn't have to insert this extra instruction?
I would have thought that it could infer this from the fact that I gave it a short as the expression for the "=r" constraint.

Related

How does the inline assembly in this compare-exchange function work? (%H modifier on ARM)

static inline unsigned long long __cmpxchg64(unsigned long long *ptr,unsigned long long old,unsigned long long new)
{
unsigned long long oldval;
unsigned long res;
prefetchw(ptr);
__asm__ __volatile__(
"1: ldrexd %1, %H1, [%3]\n"
" teq %1, %4\n"
" teqeq %H1, %H4\n"
" bne 2f\n"
" strexd %0, %5, %H5, [%3]\n"
" teq %0, #0\n"
" bne 1b\n"
"2:"
: "=&r" (res), "=&r" (oldval), "+Qo" (*ptr)
: "r" (ptr), "r" (old), "r" (new)
: "cc");
return oldval;
}
I find in gnu manual (extend extended-asm) that 'H' in '%H1' means 'Add 8 bytes to an offsettable memory reference'.
But I think if I want to load double word long data to oldval (a long long value), it should be add 4 bytes to the '%1' which is the low 32 bits of oldval as the high 32 bits of oldval. So what is my mistake?

I find in gnu manual(extend extended-asm) that 'H' in '%H1' means 'Add 8 bytes to an offsettable memory reference'.
That table of template modifiers is for x86 only. It is not applicable to ARM.
The template modifiers for ARM are unfortunately not documented in the GCC manual, but they are defined in the armclang manual and GCC conforms to those definitions as far as I can tell. So the correct meaning of the H template modifier here is:
The operand must use the r constraint, and must be a 64-bit integer or floating-point type. The operand is printed as the highest-numbered register holding half of the value.
Now this makes sense. Operand 1 to the inline asm is oldval which is of type unsigned long long, 64 bits, so the compiler will allocate two consecutive 32-bit general purpose registers for it. Let's say they are r4 and r5 as in this compiled output. Then %1 will expand to r4, and %H1 will expand to r5, which is exactly what the ldrexd instruction needs. Likewise, %4, %H4 expanded to r2, r3, and %5, %H5 expanded to fp, ip, which are alternative names for r11, r12.
The answer by frant explains what a compare-exchange is supposed to do. (The spelling cmpxchg might come from the mnemonic for the x86 compare-exchange instruction.) And if you read through the code now, you should see that it does exactly that. The teq; teqeq; bne between ldrexd and strexd will abort the store if old and *ptr were unequal. And the teq; bne after strexd will cause a retry if the exclusive store failed, which happens if there was an intervening access to *ptr (by another core, interrupt handler, etc). That is how atomicity is ensured.

How do I copy a string to another string with inline assembly?

This simple code should copy the string "c" in the string "d", changing only the first char to 'x':
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb $'x', (%%ecx)\n"
"movb 1(%%ebx), %%al\n"
"movb %%al, 1(%%ecx)\n"
"movb 2(%%ebx), %%al\n"
"movb %%al, 2(%%ecx)\n"
"movb 3(%%ebx), %%al\n"
"movb %%al, 3(%%ecx)\n"
"movb $0, 4(%%ecx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("%s\n", d);
return 0;
}
But it gives a segmentation fault error. I believe my problem is with the constraints, but I can't figure how to fix this.
What is the right way, and how can I change this code to work?

Yes, the input/output operands are wrong. The format is this:
__asm__("<instructions>\n\t"
: OutputOperands
: InputOperands
: Clobbers);
You have the inputs and outputs backwards. You have c as an output, when it should be an input (since you're reading from it). You have d as an input, when it should be an output (since you're writing c to it).
Thus, your inline assembly should be written as follows:
__asm__("leal %1, %%ebx\n\t"
"leal %0, %%ecx\n\t"
"movb $'x', (%%ecx)\n\t"
"movb 1(%%ebx), %%al\n\t"
"movb %%al, 1(%%ecx)\n\t"
"movb 2(%%ebx), %%al\n\t"
"movb %%al, 2(%%ecx)\n\t"
"movb 3(%%ebx), %%al\n\t"
"movb %%al, 3(%%ecx)\n\t"
"movb $0, 4(%%ecx)"
: "=m" (d)
: "m" (c)
: "%ebx", "%ecx", "%eax"
);
But, you are also not making the most efficient use of the operands. You have several manual load operations (lea) that you've written in assembly. You don't need to write these; that's the whole point of the Gnu extended inline assembly syntax—the compiler will generate the necessary load and store instructions for you. Not only does that make the code simpler and easier to write and maintain, but it also makes it more efficient, because the compiler can better schedule/arrange the loads and stores within surrounding code, and skip the lea instructions entirely.
Making these modifications to use operands more efficiently, as well as using names for the operands to make the code easier to read, you would have:
__asm__("movb $'x', (%[d])\n\t"
"movb 1(%[c]), %%al\n\t"
"movb %%al, 1(%[d])\n\t"
"movb 2(%[c]), %%al\n\t"
"movb %%al, 2(%[d])\n\t"
"movb 3(%[c]), %%al\n\t"
"movb %%al, 3(%[d])\n\t"
"movb $0, 4(%[d])"
: "=m" (d) // dummy arg: tell the compiler we write all of d[]
: [c] "r" (c)
, [d] "r" (d)
, "m" (c) // unused dummy arg: tell the compiler we read all of c[]
: "%eax"
);
We're asking the compiler for the pointers to be in registers with the r constraint, so we can still choose the addressing mode (reg+displacement) ourselves, as in your original. This causes the compiler to implicitly generate the two required lea instructions. Not only does this make the code simpler to write, but it also lets the compiler choose which registers it wants to use, which can make the code more efficient. (For example, it needs d in %rdi as an arg for printf. Compiler-generated setup for asm statements is optimized along with normal code, so it doesn't have to repeat this work like it would if you wrote the lea explicitly in asm. Leave as much as possible to the compiler, so it can optimize away when possible.)
Note that asking for a pointer with an r constraint doesn't imply that you dereference it. Thus, we use "m" and "=m" dummy memory operands to tell the compiler what memory is read and written, so it will ensure that the contents match program-order even in a more complex case where your function is inlined into another function that modifies c[] and d[] before and after. This works well because c[] and d[] are true C arrays, with static size. It wouldn't work if they were just pointers that you got from function args. In that case, "=m" (d) would tell the compiler that the asm writes a pointer value into a memory location, not the pointed-to contents. "=m" (*d) would tell the compiler that the asm writes the first byte. As the official docs point out, you could write something ugly using a GNU C statement-expression like:
{"m"( ({ const struct { char x[5]; } *p = (const void *)c ; *p; }) )}
Or, you could instead use a "memory" clobber to tell the compiler that all memory must be in sync. With no output operands at all, the asm block would be implicitly __volatile__, which also prevents reordering. But if you had one unused dummy output to let the compiler choose a scratch register (see below), and didn't manually use __volatile__, then the compiler would prove to itself that you never use the results and optimize out the entire block of inline assembly! (It's better to tell the compiler in as much detail as possible how your asm interacts with C variables, rather than relying on __volatile__.)
Letting the compiler choose the addressing mode will work fine for us. It avoids an extra compiler-generated lea instruction ahead of the asm block, and it simplifies the constraints because we actually use the memory operands instead of separately asking for pointers in registers.
(The compiler could still have avoided an lea in the other version if it put c[] or d[] at esp+0, so the pointer-register operand could be esp).
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %%al\n\t"
"movb %%al, 1 + %[d]\n\t"
"movb 2 + %[c], %%al\n\t"
"movb %%al, 2 + %[d]\n\t"
"movb 3 + %[c], %%al\n\t"
"movb %%al, 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d) // not sure if early-clobber is needed,
// e.g. if the compiler would otherwise be allowed to put an output memory operand at the same address as an input operand.
// It's an error with gcc 4.7 and earlier, but gcc that old also doesn't accept "m"(c) as an input memory operand
: [c] "m" (c)
: "%eax"
);
See also Looping over arrays with inline assembly for more discussion of picking addressing mode yourself vs. using "m" constraints to let the compiler pick. (If you don't want to get into that level of optimization, you probably shouldn't be using inline asm in the first place.)
The compiler will turn 3 + %[c] into something like 3 + 6(%rsp), which the assembler will evaluate the same as 9(%rsp). Fortunately, it's not a syntax error if the substitution ends up producing 3 + (%rdi). (You do get a warning, though: Warning: missing operand; zero assumed).
It would also be correct to use an "o" constraint to request an "offsetable" memory operand, but all x86 addressing modes are offsetable (you can add a compile-time-constant displacement and they're still valid), so "m" should always work. (It would be nice if "o" would add an explicit 0 to avoid the assembler warning, but it doesn't).
But we're not done with possible optimizations yet. We're still forcing the compiler to clobber the eax register when we don't actually need to use that one—any general-purpose register will do. So, we introduce another output, this time a write-only (but early-clobber) temporary stored in a register:
char temp;
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %[temp]\n\t"
"movb %[temp], 1 + %[d]\n\t"
"movb 2 + %[c], %[temp]\n\t"
"movb %[temp], 2 + %[d]\n\t"
"movb 3 + %[c], %[temp]\n\t"
"movb %[temp], 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
: // no clobbers
);
The early-clobber is necessary to stop the compiler from choosing a register that is also used in the addressing-modes for c or d. The asm syntax is designed to efficiently wrap a single instruction which reads all its inputs before writing any of its outputs.
Okay, we've made the interface between the inline assembly block and the surrounding compiler-generated code pretty much optimal—but let's look at the actual assembly language instructions we're using inside of it. These are far from optimal: we're writing one byte at a time when we could be writing four bytes at a time! (And, on a 64-bit build, we could be writing eight bytes at a time, but that wouldn't help us here.) So, let's just do:
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl 1 + %[c], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
:
);
This writes the first byte (an 'x' character) into d, and then copies 4 bytes from c into d. That will include the terminating NUL character from c (automatically appended to string literals by a C or C++ compiler), so the string in d is already NUL-terminated without needing to append an additional byte.
Shorter and faster, except for the store-forwarding stall from reading the last 4 bytes of c[] right after the compiler-generated initialization code stored the first 4 bytes and then a separate byte store of the terminating 0. You wouldn't have this problem if you used static const char c[] = "abcd";, (because then it would be in static storage instead of stored to the stack with mov-immediate every time the function runs), or if c[] was a function arg that probably wasn't just written. Out-of-order execution can hide the store-forwarding stall latency, so it's probably worth it if c[] is usually not just-written.
Notice that we are not reading from the first character of c—we just offset it as part of the movl instruction. We could tell the compiler about that to allow it to optimize by moving stores to c[0] across the asm statement. We could even ask for a [cplus1] "r" (&c[1]) input operand, which would be good if we needed the address in a register. (See the original version of this answer for that.)
Since it's exactly 4 bytes, we can cast to a 4-byte integer type, rather than defining a struct with a char[4] member or something. Remember that a memory operand refers to a value in memory, so you have to dereference a pointer. Arrays are a special case: "m" (c) references the 5-byte contents of c[], not the 4 or 8-byte pointer value. But as soon as we start casting, we just have a pointer. Even a function argument like int foo(const char c[static 5]) works like a char*, not a char [5]. Anyway, the *(const uint32_t*)&c[1] is 4 bytes in memory from c[1] to c[3]. GCC warns about strict-aliasing with that cast, so maybe a struct { char c[4]; } would be better. (gcc8-snapshot 20170628 doesn't warn. Maybe the code is fine, or maybe the warning is broken in that unstable gcc version.)
// tightest constraints possible: 4 byte input memory operand, 5 byte output operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1] "m" (*(const uint32_t*)&c[1])
:
);
The code is looking pretty good now. Here's the code for the full function, as generated by GCC 6.3 on the Godbolt Compiler Explorer (with -O3 -m32 to generate 32-bit code like in the question):
subl $40, %esp
movl $1684234849, 18(%esp) # store 'abcd' into c
movb $0, 22(%esp) # NUL-terminate c
# begin inline-asm block
movb $'x', 23(%esp) # write initial 'x' into d[0]
movl 19(%esp), %eax # get 4 characters starting at c[1]
movl %eax, 1 + 23(%esp) # write those 4 characters into d, starting at d[1]
# end inline-asm block
leal 23(%esp), %eax # load address of c[1] into EAX register
pushl %eax # push address of d[0] onto stack
call puts # call 'puts' to output string. printf("%s\n", d) optimizes to this.
xorl %eax, %eax
addl $44, %esp
ret
gcc decides to save a register by delaying the lea until after the asm block. With -m64, it does lea before the asm, but it still uses a stack-pointer address instead of the register it just set up. That lets the loads/stores run without waiting for the lea to execute, but it also wastes code-size. Since lea is fast, it's not what I'd do if writing by hand.
The "r" constraint version uses two separate subl instructions to reserve stack space: subl $28, %esp before initializing c[], and subl $12, %esp right before the asm block. This is just a missed optimization by the compiler, unlike the extra lea which is unavoidable.
Notice that this is much much worse than the asm you'd get from the much more sensible:
d[0] = 'x';
memcpy(&d[1], &c[1], 4);
In that case, c[] optimizes away entirely and you get almost the same code that char d[] = "xbcd"; would produce. (See test_memcpy() in the Godbolt link above). The inline-asm version is only useful as an example or template for wrapping other memory-to-memory instruction sequences.
So how do we test that we got all the constraints right, allowing the compiler to optimize as far as correctness allows but no further? In this case, storing into c[] and d[] before and after the asm statement provides a good check. Recent gcc versions really will combine those stores into a single store either before or after if the constraints allow it. (clang won't, though.)
int optimize_test(void) {
// static // const
char c[5] = "abcd";
char d[5];
c[3] = 'O'; // **not** optimized away: part of the 32-bit input memory operand
c[0] = '0'; // merged with the c[0]='1' after the asm, because the asm doesn't read this part of c[]
d[3] = 'E'; // optimized away because the whole d[] is an output-only operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1 "m" (*(const uint32_t*)&c[1])
:
);
c[0] = '1'; // these dead stores into c[] are not optimized away, for some reason. (Even with memcpy instead of an asm statement.)
c[3] = 'M';
d[3] = 'D';
printf("%s\n", d);
return 0;
}
There are a couple of additional tweaks that you could do with the inline assembly. For example, our clobbers are telling the compiler that it cannot re-use one of the input registers for the temp register, but it actually could. But these are all pretty subtle. If you actually cared about getting the best possible code from the compiler, you'd write the above code in C like I just showed.
There are many reasons not to use inline assembly, including performance: you'll probably just defeat the compiler's ability to optimize. If the compiler isn't doing a good job somewhere (for a specific compiler version for a specific target architecture), often you can coax it into making better assembly by just changing the C source, without resorting to inline asm. (Although it's often possible for an expert that really knows what they're doing to beat the compiler, this often requires writing the entire loop in asm and requires a significant investment in time. And if you don't know what you're doing, you can easily make it slower.)
If you're interested in learning assembly language, you should be using an assembler to write the code, not a C compiler. This is all just busy-work! It took me way too long to write this answer, and had to get help from other experts to ensure that I got all of the constraints and clobbers precisely correct so as to cause optimal code to be generated, and I know what I'm doing! This would have been a 2-minute task in assembly:
lea eax, DWORD PTR [d]
lea edx, DWORD PTR [c+1]
mov BYTE PTR [eax], 'x'
mov edx, DWORD PTR [edx]
mov DWORD PTR [eax+1], edx
…and you can easily verify that it is correct!
Extra notes from #PeterCordes: If we can assume that these strings are constants/literals, then this would actually be much better:
mov DWORD PTR [d], 'xbcd' ; 0x64636278
mov BYTE PTR [d+4], 0
where d can be any addressing mode, for example [esp+6]. If we just want to pass the string to a function, writing in pure asm lets us do things like this that the compiler wouldn't, giving excellent code size and performance:
push 0 ; includes 3 extra bytes of 0 padding, but gcc was leaving more than that already
push 'xbcd' ; ESP is now pointing to the string data we just pushed
push esp ; pushes the old value. (push stack-pointer costs 1 extra uop on Intel CPUs, and AMD Ryzen, but the LEA or MOV we avoid would also be a uop).
call puts
Making the compiler store into c[] and then reloading that inside the asm statement is just silly. You could achieve this by passing in the data as a 4-byte integer with an "ri" constraint. Or maybe using if (__builtin_constant_p(data)) { } else { } to branch on whether the data was a compile-time constant or not.
If the contents of c[] aren't supposed to be a compile-time constant, and if we can assume an offset load from c[] won't cause a store-forwarding stall, the general idea of Cody's final version is good:
lea rdi, [d] ; or "mov edi, OFFSET d" if you don't need a 64-bit RIP-relative LEA for PIC code
mov edx, DWORD PTR [c+1] ; load before store to avoid any potential false dependency
mov BYTE PTR [rdi], 'x'
mov DWORD PTR [rdi+1], edx
The lea is only worth it if we need d's address in a register afterwards (which we do in this case for printf / puts). Otherwise it's better to just use [d] and [d+1], even if the addressing mode needs a 32-bit displacement. (It doesn't in this case, since c and d are both on the stack).
Or, if there's padding after d[], and targeting 64-bit, we could load 8 bytes from c (if you know that the load won't cross into another page—a cache-line split on the load or store might also make this not worth it for perf reasons):
lea rdi, [d]
mov rdx, QWORD PTR [c]
mov QWORD PTR [rdi], rdx
mov BYTE PTR [rdi], 'x' ; overlapping store: rewrite the first byte
On some CPUs, e.g. Intel since Ivy Bridge, this will be good even if c[] was just written (avoids the store-forwarding stall):
mov edx, DWORD PTR [c]
mov dl, 'x' ; modify the low byte. reading edx later will cause a partial-reg stall on older Intel CPUs
mov byte ptr[d+4], 0
mov dword ptr[d], edx
There are other ways to replace the first byte, e.g. AND and OR, which avoid problems on older Intel CPUs.
This has the advantage that reading multiple bytes at once from the start of d[] won't suffer a store-forwarding stall, since the first 4 bytes are written with a store aligned to the start of d[].
Combining both previous ideas:
mov rdx, QWORD PTR [c]
mov dl, 'x'
mov QWORD PTR [d], rdx
As usual, the optimal choice strongly depends on context (surrounding code), and on target CPU microarchitecture (Nehalem vs. Skylake vs. Silvermont vs. Bulldozer vs. Ryzen ...)

First of all, your code for string copy did not result in an exception, when I built using gcc and executed on my Windows PC. However, the string copy was not happening because your code appears to assume that register ecx points to variable d, when it actually points to variable c. The following code copies string contents of variable c to d, then replaces the first character in array d, with x. Try Compiling with gcc.
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb (%%ecx), %%al\n"
"movb %%al, (%%ebx)\n"
"movb 1(%%ecx), %%al\n"
"movb %%al, 1(%%ebx)\n"
"movb 2(%%ecx), %%al\n"
"movb %%al, 2(%%ebx)\n"
"movb 3(%%ecx), %%al\n"
"movb %%al, 3(%%ebx)\n"
"movb 4(%%ecx), %%al\n"
"movb %%al, 4(%%ebx)\n"
"movb $'x', (%%ebx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("String d is: %s\n", d);
printf("String c remains: %s\n", c);
return 0;
}
When using MinGW gcc compiler on Windows PC, the following out put is produced:
> gcc testAsm.c
> .\a.exe
String d is: xbcd
String c remains: abcd

What is the correct use of multiple input and output operands in extended GCC asm?

What is the correct use of multiple input and output operands in extended GCC asm under register constraint? Consider this minimal version of my problem. The following brief extended asm code in GCC, AT&T syntax:
int input0 = 10;
int input1 = 15;
int output0 = 0;
int output1 = 1;
asm volatile("mov %[input0], %[output0]\t\n"
"mov %[input1], %[output1]\t\n"
: [output0] "=r" (output0), [output1] "=r" (output1)
: [input0] "r" (input0), [input1] "r" (input1)
:);
printf("output0: %d\n", output0);
printf("output1: %d\n", output1);
The syntax appears correct based on https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html However, I must have overlooked something or be committing some trivial mistake that I for some reason can't see.
The output with GCC 5.3.0 p1.0 (no compiler arguments) is:
output0: 10
output1: 10
Expected output is:
output0: 10
output1: 15
Looking at it in GDB shows:
0x0000000000400581 <+43>: mov eax,DWORD PTR [rbp-0x10]
0x0000000000400584 <+46>: mov edx,DWORD PTR [rbp-0xc]
0x0000000000400587 <+49>: mov edx,eax
0x0000000000400589 <+51>: mov eax,edx
0x000000000040058b <+53>: mov DWORD PTR [rbp-0x8],edx
0x000000000040058e <+56>: mov DWORD PTR [rbp-0x4],eax
From what I can see it loads eax with input0 and edx with input1. It then overwrites edx with eax and eax with edx, making these equal. It then writes these back into output0 and output1.
If I use a memory constraint (=m) instead of a register constraint (=r) for the output, it gives the expected output and the assembly looks more reasonable.

The problem is that GCC assumes that all all output operands are only written at the end of the instruction, after all input operands have been consumed. This means it can use the same operand (eg. a register) as an input operand and an output operand which is what is happening here. The solution is to mark [output0] with an early clobber constraint so that GCC knows that its written to before the end of the asm statement.
For example:
asm volatile("mov %[input0], %[output0]\t\n"
"mov %[input1], %[output1]\t\n"
: [output0] "=&r" (output0), [output1] "=r" (output1)
: [input0] "r" (input0), [input1] "r" (input1)
:);
You don't need to mark [output1] as early clobber because it's only written to at the end of the instruction so it doesn't matter if it uses the same register as [input0] or [input1].

How force gcc to use two different registers to operands in asm?

I'm using gcc 4.6.3 compiled for sam7s processor.
I need to use some inline assembly:
int res;
asm volatile (" \
MRS r0,CPSR \n \
MOV %0, r0 \n \
BIC r0,r0,%1 \n \
MSR CPSR,r0 \n \
" : "=r" (res) : "r" (0xc0) : "r0" );
return res;
Which is translated by gcc to (comments added by me):
mov r3, #192 ; load 0xc0 to r3
str r0, [sl, #56] ; preserve value of r0?
mrs r0, CPSR ; load CPSR to r0
mov r3, r0 ; save r0 to "res"; r3 overwritten!
bic r0, r0, r3 ;
msr CPSR_fc, r0 ;
The problem is that in place of "%0" (res) and "%1" (constant: 0xc0) the same register "r3" is used. For this reason %1 is overwritten before it is used, and code work incorrectly.
The question is how can I forbid gcc to use the same register for input/output operands?

Ok, finally i've found it here
& says that an output operand is written to before the inputs are
read, so this output must not be the same register as any input.
Without this, gcc may place an output and an input in the same
register even if not required by a "0" constraint. This is very
useful, but is mentioned here because it's specific to an
alternative. Unlike = and %, but like ?, you have to include it
with each alternative to which it applies.
After changing "=r" (res) to "=&r" (res) everything works fine.

In GCC-style extended inline asm, is it possible to output a "virtualized" boolean value, e.g. the carry flag?

If I have the following C++ code to compare two 128-bit unsigned integers, with inline amd-64 asm:
struct uint128_t {
uint64_t lo, hi;
};
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
uint64_t temp;
bool result;
__asm__(
"cmpq %3, %2;"
"sbbq %4, %1;"
"setc %0;"
: // outputs:
/*0*/"=r,1,2"(result),
/*1*/"=r,r,r"(temp)
: // inputs:
/*2*/"r,r,r"(a.lo),
/*3*/"emr,emr,emr"(b.lo),
/*4*/"emr,emr,emr"(b.hi),
"1"(a.hi));
return result;
}
Then it will be inlined quite efficiently, but with one flaw. The return value is done through the "interface" of a general register with a value of 0 or 1. This adds two or three unnecessary extra instructions and detracts from a compare operation that would otherwise be fully optimized. The generated code will look something like this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
setc al
movzx eax, al
test eax, eax
jnz is_lessthan
If I use "sbb %0,%0" with an "int" return value instead of "setc %0" with a "bool" return value, there's still two extra instructions:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
sbb eax, eax
test eax, eax
jnz is_lessthan
What I want is this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
jc is_lessthan
GCC extended inline asm is wonderful, otherwise. But I want it to be just as good as an intrinsic function would be, in every way. I want to be able to directly return a boolean value in the form of the state of a CPU flag or flags, without having to "render" it into a general register.
Is this possible, or would GCC (and the Intel C++ compiler, which also allows this form of inline asm to be used) have to be modified or even refactored to make it possible?
Also, while I'm at it — is there any other way my formulation of the compare operator could be improved?

Here we are almost 7 years later, and YES, gcc finally added support for "outputting flags" (added in 6.1.0, released ~April 2016). The detailed docs are here, but in short, it looks like this:
/* Test if bit 0 is set in 'value' */
char a;
asm("bt $0, %1"
: "=#ccc" (a)
: "r" (value) );
if (a)
blah;
To understand =#ccc: The output constraint (which requires =) is of type #cc followed by the condition code to use (in this case c to reference the carry flag).
Ok, this may not be an issue for your specific case anymore (since gcc now supports comparing 128bit data types directly), but (currently) 1,326 people have viewed this question. Apparently there's some interest in this feature.
Now I personally favor the school of thought that says don't use inline asm at all. But if you must, yes you can (now) 'output' flags.
FWIW.

I don't know a way to do this. You may or may not consider this an improvement:
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
register uint64_t temp = a.hi;
__asm__(
"cmpq %2, %1;"
"sbbq $0, %0;"
: // outputs:
/*0*/"=r"(temp)
: // inputs:
/*1*/"r"(a.lo),
/*2*/"mr"(b.lo),
"0"(temp));
return temp < b.hi;
}
It produces something like:
mov rdx, [r14]
mov rax, [r14+8]
cmp rdx, [r15]
sbb rax, 0
cmp rax, [r15+8]
jc is_lessthan

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio