GCC Inline Assembly side effects

GCC Inline Assembly side effects - gcc

Could someone explain me (in other words) the following section from GCC doc:
Here is a fictitious sum of squares instruction, that takes two pointers to floating point values in memory and produces a floating point register output. Notice that x, and y both appear twice in the asm parameters, once to specify memory accessed, and once to specify a base register used by the asm. You won’t normally be wasting a register by doing this as GCC can use the same register for both purposes. However, it would be foolish to use both %1 and %3 for x in this asm and expect them to be the same. In fact, %3 may well not be a register. It might be a symbolic memory reference to the object pointed to by x.
asm ("sumsq %0, %1, %2"
: "+f" (result)
: "r" (x), "r" (y), "m" (*x), "m" (*y));
Here is a fictitious *z++ = *x++ * *y++ instruction. Notice that the x, y and z pointer registers must be specified as input/output because the asm modifies them.
asm ("vecmul %0, %1, %2"
: "+r" (z), "+r" (x), "+r" (y), "=m" (*z)
: "m" (*x), "m" (*y));
In the first example what is the point for listing *x and *y in the input operands? The same doc states:
In particular, there is no way to specify that input operands get modified without also specifying them as output operands.
In the second example why use the input operands section at all? None of its operands is used in the assembly statement anyway.
And as a bonus, how could one change the following example from this SO post so there was no need for the volatile keyword?
void swap_2 (int *a, int *b)
{
int tmp0, tmp1;
__asm__ volatile (
"movl (%0), %k2\n\t" /* %2 (tmp0) = (*a) */
"movl (%1), %k3\n\t" /* %3 (tmp1) = (*b) */
"cmpl %k3, %k2\n\t"
"jle %=f\n\t" /* if (%2 <= %3) (at&t!) */
"movl %k3, (%0)\n\t"
"movl %k2, (%1)\n\t"
"%=:\n\t"
: "+r" (a), "+r" (b), "=r" (tmp0), "=r" (tmp1) :
: "memory" /* "cc" */ );
}
Thanks in advance. I'm struggling with this for two days now.

In the first example, *x and *y have to be listed as input operands so that GCC knows that the outcome of the instruction depends on them. Otherwise, GCC could move stores to *x and *y past the inline assembly fragment, which would then access uninitialized memory. This can be seen by compiling this example:
double
f (void)
{
double result;
double a = 5;
double b = 7;
double *x = &a;
double *y = &b;
asm ("sumsq %0, %1, %2"
: "+X" (result)
: "r" (x), "r" (y) /*, "m" (*x), "m" (*y)*/);
return result;
}
Which results in:
f:
leaq -16(%rsp), %rax
leaq -8(%rsp), %rdx
pxor %xmm0, %xmm0
#APP
# 8 "t.c" 1
sumsq %xmm0, %rax, %rdx
# 0 "" 2
#NO_APP
ret
The two leaq instructions just set up the registers to point into the uninitialized red zone on the stack. The assignments are gone.
The same is true for the second example as well.
I think you can use the same trick to eliminate the volatile. But I think it is not actually necessary here because there already is a "memory" clobber, which tells GCC that memory is read or written from inline assembly.

Related

How does the inline assembly in this compare-exchange function work? (%H modifier on ARM)

static inline unsigned long long __cmpxchg64(unsigned long long *ptr,unsigned long long old,unsigned long long new)
{
unsigned long long oldval;
unsigned long res;
prefetchw(ptr);
__asm__ __volatile__(
"1: ldrexd %1, %H1, [%3]\n"
" teq %1, %4\n"
" teqeq %H1, %H4\n"
" bne 2f\n"
" strexd %0, %5, %H5, [%3]\n"
" teq %0, #0\n"
" bne 1b\n"
"2:"
: "=&r" (res), "=&r" (oldval), "+Qo" (*ptr)
: "r" (ptr), "r" (old), "r" (new)
: "cc");
return oldval;
}
I find in gnu manual (extend extended-asm) that 'H' in '%H1' means 'Add 8 bytes to an offsettable memory reference'.
But I think if I want to load double word long data to oldval (a long long value), it should be add 4 bytes to the '%1' which is the low 32 bits of oldval as the high 32 bits of oldval. So what is my mistake?

I find in gnu manual(extend extended-asm) that 'H' in '%H1' means 'Add 8 bytes to an offsettable memory reference'.
That table of template modifiers is for x86 only. It is not applicable to ARM.
The template modifiers for ARM are unfortunately not documented in the GCC manual, but they are defined in the armclang manual and GCC conforms to those definitions as far as I can tell. So the correct meaning of the H template modifier here is:
The operand must use the r constraint, and must be a 64-bit integer or floating-point type. The operand is printed as the highest-numbered register holding half of the value.
Now this makes sense. Operand 1 to the inline asm is oldval which is of type unsigned long long, 64 bits, so the compiler will allocate two consecutive 32-bit general purpose registers for it. Let's say they are r4 and r5 as in this compiled output. Then %1 will expand to r4, and %H1 will expand to r5, which is exactly what the ldrexd instruction needs. Likewise, %4, %H4 expanded to r2, r3, and %5, %H5 expanded to fp, ip, which are alternative names for r11, r12.
The answer by frant explains what a compare-exchange is supposed to do. (The spelling cmpxchg might come from the mnemonic for the x86 compare-exchange instruction.) And if you read through the code now, you should see that it does exactly that. The teq; teqeq; bne between ldrexd and strexd will abort the store if old and *ptr were unequal. And the teq; bne after strexd will cause a retry if the exclusive store failed, which happens if there was an intervening access to *ptr (by another core, interrupt handler, etc). That is how atomicity is ensured.

gcc inline assembly behave strangely

I am learning GCC's extended inline assembly currently. I wrote an A + B function and wants to detect the ZF flag, but things behave strangely.
The compiler I use is gcc 7.3.1 on x86-64 Arch Linux.
I started from the following code, this code will correctly print the a + b.
int a, b, sum;
scanf("%d%d", &a, &b);
asm volatile (
"movl %1, %0\n"
"addl %2, %0\n"
: "=r"(sum)
: "r"(a), "r"(b)
: "cc"
);
printf("%d\n", sum);
Then I simply added a variable to check flags, it gives me wrong output.
int a, b, sum, zero;
scanf("%d%d", &a, &b);
asm volatile (
"movl %2, %0\n"
"addl %3, %0\n"
: "=r"(sum), "=#ccz"(zero)
: "r"(a), "r"(b)
: "cc"
);
printf("%d %d\n", sum, zero);
The GAS assembly output is
movl -24(%rbp), %eax # %eax = a
movl -20(%rbp), %edx # %edx = b
#APP
# 6 "main.c" 1
movl %eax, %edx
addl %edx, %edx
# 0 "" 2
#NO_APP
sete %al
movzbl %al, %eax
movl %edx, -16(%rbp) # sum = %edx
movl %eax, -12(%rbp) # zero = %eax
This time, the sum will become a + a. But when I just exchanged %2 and %3, the output will be correct a + b.
Then I checked various gcc version (It seems clang does not support it when output is a flag) on wandbox.org, from version 4.5.4 to version 4.7.4 gives the correct result a + b, and starting from version 4.8.1 the outputs are all a + a.
My question is: did I write the wrong code or is there anything wrong with gcc?

The problem is that you clobber %0 before all the inputs (%2 in your case) are consumed:
"movl %1, %0\n"
"addl %2, %0\n"
%0 is being modified by the first MOV before %2 has been consumed. It is possible for an optimizing compiler to re-use a register for an input constraint that was used for an output constraint. In your case one of the compilers chose to use the same register for %2 and %0 which caused the erroneous results.
To get around this problem of changing a register that is being modified before all the inputs are consumed is to mark the output constraint with a &. The & is a modifier denoting Early Clobber:
‘&’
Means (in a particular alternative) that this operand is an earlyclobber operand, which is written before the instruction is finished using the input operands. Therefore, this operand may not lie in a register that is read by the instruction or as part of any memory address.
‘&’ applies only to the alternative in which it is written. In constraints with multiple alternatives, sometimes one alternative requires ‘&’ while others do not. See, for example, the ‘movdf’ insn of the 68000.
A operand which is read by the instruction can be tied to an earlyclobber operand if its only use as an input occurs before the early result is written. Adding alternatives of this form often allows GCC to produce better code when only some of the read operands can be affected by the earlyclobber. See, for example, the ‘mulsi3’ insn of the ARM.
Furthermore, if the earlyclobber operand is also a read/write operand, then that operand is written only after it’s used.
‘&’ does not obviate the need to write ‘=’ or ‘+’. As earlyclobber operands are always written, a read-only earlyclobber operand is ill-formed and will be rejected by the compiler.
The change to your code would be to modify "=r"(sum) to be "=&r"(sum). This will prevent the compiler from using the register used for the output constraint for one of the input constraints.
Word of warning. GCC Inline Assembly is powerful and evil. Very easy to get wrong if you don't know what you are doing. Only use it if you must, avoid it if you can.

How do I copy a string to another string with inline assembly?

This simple code should copy the string "c" in the string "d", changing only the first char to 'x':
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb $'x', (%%ecx)\n"
"movb 1(%%ebx), %%al\n"
"movb %%al, 1(%%ecx)\n"
"movb 2(%%ebx), %%al\n"
"movb %%al, 2(%%ecx)\n"
"movb 3(%%ebx), %%al\n"
"movb %%al, 3(%%ecx)\n"
"movb $0, 4(%%ecx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("%s\n", d);
return 0;
}
But it gives a segmentation fault error. I believe my problem is with the constraints, but I can't figure how to fix this.
What is the right way, and how can I change this code to work?

Yes, the input/output operands are wrong. The format is this:
__asm__("<instructions>\n\t"
: OutputOperands
: InputOperands
: Clobbers);
You have the inputs and outputs backwards. You have c as an output, when it should be an input (since you're reading from it). You have d as an input, when it should be an output (since you're writing c to it).
Thus, your inline assembly should be written as follows:
__asm__("leal %1, %%ebx\n\t"
"leal %0, %%ecx\n\t"
"movb $'x', (%%ecx)\n\t"
"movb 1(%%ebx), %%al\n\t"
"movb %%al, 1(%%ecx)\n\t"
"movb 2(%%ebx), %%al\n\t"
"movb %%al, 2(%%ecx)\n\t"
"movb 3(%%ebx), %%al\n\t"
"movb %%al, 3(%%ecx)\n\t"
"movb $0, 4(%%ecx)"
: "=m" (d)
: "m" (c)
: "%ebx", "%ecx", "%eax"
);
But, you are also not making the most efficient use of the operands. You have several manual load operations (lea) that you've written in assembly. You don't need to write these; that's the whole point of the Gnu extended inline assembly syntax—the compiler will generate the necessary load and store instructions for you. Not only does that make the code simpler and easier to write and maintain, but it also makes it more efficient, because the compiler can better schedule/arrange the loads and stores within surrounding code, and skip the lea instructions entirely.
Making these modifications to use operands more efficiently, as well as using names for the operands to make the code easier to read, you would have:
__asm__("movb $'x', (%[d])\n\t"
"movb 1(%[c]), %%al\n\t"
"movb %%al, 1(%[d])\n\t"
"movb 2(%[c]), %%al\n\t"
"movb %%al, 2(%[d])\n\t"
"movb 3(%[c]), %%al\n\t"
"movb %%al, 3(%[d])\n\t"
"movb $0, 4(%[d])"
: "=m" (d) // dummy arg: tell the compiler we write all of d[]
: [c] "r" (c)
, [d] "r" (d)
, "m" (c) // unused dummy arg: tell the compiler we read all of c[]
: "%eax"
);
We're asking the compiler for the pointers to be in registers with the r constraint, so we can still choose the addressing mode (reg+displacement) ourselves, as in your original. This causes the compiler to implicitly generate the two required lea instructions. Not only does this make the code simpler to write, but it also lets the compiler choose which registers it wants to use, which can make the code more efficient. (For example, it needs d in %rdi as an arg for printf. Compiler-generated setup for asm statements is optimized along with normal code, so it doesn't have to repeat this work like it would if you wrote the lea explicitly in asm. Leave as much as possible to the compiler, so it can optimize away when possible.)
Note that asking for a pointer with an r constraint doesn't imply that you dereference it. Thus, we use "m" and "=m" dummy memory operands to tell the compiler what memory is read and written, so it will ensure that the contents match program-order even in a more complex case where your function is inlined into another function that modifies c[] and d[] before and after. This works well because c[] and d[] are true C arrays, with static size. It wouldn't work if they were just pointers that you got from function args. In that case, "=m" (d) would tell the compiler that the asm writes a pointer value into a memory location, not the pointed-to contents. "=m" (*d) would tell the compiler that the asm writes the first byte. As the official docs point out, you could write something ugly using a GNU C statement-expression like:
{"m"( ({ const struct { char x[5]; } *p = (const void *)c ; *p; }) )}
Or, you could instead use a "memory" clobber to tell the compiler that all memory must be in sync. With no output operands at all, the asm block would be implicitly __volatile__, which also prevents reordering. But if you had one unused dummy output to let the compiler choose a scratch register (see below), and didn't manually use __volatile__, then the compiler would prove to itself that you never use the results and optimize out the entire block of inline assembly! (It's better to tell the compiler in as much detail as possible how your asm interacts with C variables, rather than relying on __volatile__.)
Letting the compiler choose the addressing mode will work fine for us. It avoids an extra compiler-generated lea instruction ahead of the asm block, and it simplifies the constraints because we actually use the memory operands instead of separately asking for pointers in registers.
(The compiler could still have avoided an lea in the other version if it put c[] or d[] at esp+0, so the pointer-register operand could be esp).
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %%al\n\t"
"movb %%al, 1 + %[d]\n\t"
"movb 2 + %[c], %%al\n\t"
"movb %%al, 2 + %[d]\n\t"
"movb 3 + %[c], %%al\n\t"
"movb %%al, 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d) // not sure if early-clobber is needed,
// e.g. if the compiler would otherwise be allowed to put an output memory operand at the same address as an input operand.
// It's an error with gcc 4.7 and earlier, but gcc that old also doesn't accept "m"(c) as an input memory operand
: [c] "m" (c)
: "%eax"
);
See also Looping over arrays with inline assembly for more discussion of picking addressing mode yourself vs. using "m" constraints to let the compiler pick. (If you don't want to get into that level of optimization, you probably shouldn't be using inline asm in the first place.)
The compiler will turn 3 + %[c] into something like 3 + 6(%rsp), which the assembler will evaluate the same as 9(%rsp). Fortunately, it's not a syntax error if the substitution ends up producing 3 + (%rdi). (You do get a warning, though: Warning: missing operand; zero assumed).
It would also be correct to use an "o" constraint to request an "offsetable" memory operand, but all x86 addressing modes are offsetable (you can add a compile-time-constant displacement and they're still valid), so "m" should always work. (It would be nice if "o" would add an explicit 0 to avoid the assembler warning, but it doesn't).
But we're not done with possible optimizations yet. We're still forcing the compiler to clobber the eax register when we don't actually need to use that one—any general-purpose register will do. So, we introduce another output, this time a write-only (but early-clobber) temporary stored in a register:
char temp;
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %[temp]\n\t"
"movb %[temp], 1 + %[d]\n\t"
"movb 2 + %[c], %[temp]\n\t"
"movb %[temp], 2 + %[d]\n\t"
"movb 3 + %[c], %[temp]\n\t"
"movb %[temp], 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
: // no clobbers
);
The early-clobber is necessary to stop the compiler from choosing a register that is also used in the addressing-modes for c or d. The asm syntax is designed to efficiently wrap a single instruction which reads all its inputs before writing any of its outputs.
Okay, we've made the interface between the inline assembly block and the surrounding compiler-generated code pretty much optimal—but let's look at the actual assembly language instructions we're using inside of it. These are far from optimal: we're writing one byte at a time when we could be writing four bytes at a time! (And, on a 64-bit build, we could be writing eight bytes at a time, but that wouldn't help us here.) So, let's just do:
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl 1 + %[c], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
:
);
This writes the first byte (an 'x' character) into d, and then copies 4 bytes from c into d. That will include the terminating NUL character from c (automatically appended to string literals by a C or C++ compiler), so the string in d is already NUL-terminated without needing to append an additional byte.
Shorter and faster, except for the store-forwarding stall from reading the last 4 bytes of c[] right after the compiler-generated initialization code stored the first 4 bytes and then a separate byte store of the terminating 0. You wouldn't have this problem if you used static const char c[] = "abcd";, (because then it would be in static storage instead of stored to the stack with mov-immediate every time the function runs), or if c[] was a function arg that probably wasn't just written. Out-of-order execution can hide the store-forwarding stall latency, so it's probably worth it if c[] is usually not just-written.
Notice that we are not reading from the first character of c—we just offset it as part of the movl instruction. We could tell the compiler about that to allow it to optimize by moving stores to c[0] across the asm statement. We could even ask for a [cplus1] "r" (&c[1]) input operand, which would be good if we needed the address in a register. (See the original version of this answer for that.)
Since it's exactly 4 bytes, we can cast to a 4-byte integer type, rather than defining a struct with a char[4] member or something. Remember that a memory operand refers to a value in memory, so you have to dereference a pointer. Arrays are a special case: "m" (c) references the 5-byte contents of c[], not the 4 or 8-byte pointer value. But as soon as we start casting, we just have a pointer. Even a function argument like int foo(const char c[static 5]) works like a char*, not a char [5]. Anyway, the *(const uint32_t*)&c[1] is 4 bytes in memory from c[1] to c[3]. GCC warns about strict-aliasing with that cast, so maybe a struct { char c[4]; } would be better. (gcc8-snapshot 20170628 doesn't warn. Maybe the code is fine, or maybe the warning is broken in that unstable gcc version.)
// tightest constraints possible: 4 byte input memory operand, 5 byte output operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1] "m" (*(const uint32_t*)&c[1])
:
);
The code is looking pretty good now. Here's the code for the full function, as generated by GCC 6.3 on the Godbolt Compiler Explorer (with -O3 -m32 to generate 32-bit code like in the question):
subl $40, %esp
movl $1684234849, 18(%esp) # store 'abcd' into c
movb $0, 22(%esp) # NUL-terminate c
# begin inline-asm block
movb $'x', 23(%esp) # write initial 'x' into d[0]
movl 19(%esp), %eax # get 4 characters starting at c[1]
movl %eax, 1 + 23(%esp) # write those 4 characters into d, starting at d[1]
# end inline-asm block
leal 23(%esp), %eax # load address of c[1] into EAX register
pushl %eax # push address of d[0] onto stack
call puts # call 'puts' to output string. printf("%s\n", d) optimizes to this.
xorl %eax, %eax
addl $44, %esp
ret
gcc decides to save a register by delaying the lea until after the asm block. With -m64, it does lea before the asm, but it still uses a stack-pointer address instead of the register it just set up. That lets the loads/stores run without waiting for the lea to execute, but it also wastes code-size. Since lea is fast, it's not what I'd do if writing by hand.
The "r" constraint version uses two separate subl instructions to reserve stack space: subl $28, %esp before initializing c[], and subl $12, %esp right before the asm block. This is just a missed optimization by the compiler, unlike the extra lea which is unavoidable.
Notice that this is much much worse than the asm you'd get from the much more sensible:
d[0] = 'x';
memcpy(&d[1], &c[1], 4);
In that case, c[] optimizes away entirely and you get almost the same code that char d[] = "xbcd"; would produce. (See test_memcpy() in the Godbolt link above). The inline-asm version is only useful as an example or template for wrapping other memory-to-memory instruction sequences.
So how do we test that we got all the constraints right, allowing the compiler to optimize as far as correctness allows but no further? In this case, storing into c[] and d[] before and after the asm statement provides a good check. Recent gcc versions really will combine those stores into a single store either before or after if the constraints allow it. (clang won't, though.)
int optimize_test(void) {
// static // const
char c[5] = "abcd";
char d[5];
c[3] = 'O'; // **not** optimized away: part of the 32-bit input memory operand
c[0] = '0'; // merged with the c[0]='1' after the asm, because the asm doesn't read this part of c[]
d[3] = 'E'; // optimized away because the whole d[] is an output-only operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1 "m" (*(const uint32_t*)&c[1])
:
);
c[0] = '1'; // these dead stores into c[] are not optimized away, for some reason. (Even with memcpy instead of an asm statement.)
c[3] = 'M';
d[3] = 'D';
printf("%s\n", d);
return 0;
}
There are a couple of additional tweaks that you could do with the inline assembly. For example, our clobbers are telling the compiler that it cannot re-use one of the input registers for the temp register, but it actually could. But these are all pretty subtle. If you actually cared about getting the best possible code from the compiler, you'd write the above code in C like I just showed.
There are many reasons not to use inline assembly, including performance: you'll probably just defeat the compiler's ability to optimize. If the compiler isn't doing a good job somewhere (for a specific compiler version for a specific target architecture), often you can coax it into making better assembly by just changing the C source, without resorting to inline asm. (Although it's often possible for an expert that really knows what they're doing to beat the compiler, this often requires writing the entire loop in asm and requires a significant investment in time. And if you don't know what you're doing, you can easily make it slower.)
If you're interested in learning assembly language, you should be using an assembler to write the code, not a C compiler. This is all just busy-work! It took me way too long to write this answer, and had to get help from other experts to ensure that I got all of the constraints and clobbers precisely correct so as to cause optimal code to be generated, and I know what I'm doing! This would have been a 2-minute task in assembly:
lea eax, DWORD PTR [d]
lea edx, DWORD PTR [c+1]
mov BYTE PTR [eax], 'x'
mov edx, DWORD PTR [edx]
mov DWORD PTR [eax+1], edx
…and you can easily verify that it is correct!
Extra notes from #PeterCordes: If we can assume that these strings are constants/literals, then this would actually be much better:
mov DWORD PTR [d], 'xbcd' ; 0x64636278
mov BYTE PTR [d+4], 0
where d can be any addressing mode, for example [esp+6]. If we just want to pass the string to a function, writing in pure asm lets us do things like this that the compiler wouldn't, giving excellent code size and performance:
push 0 ; includes 3 extra bytes of 0 padding, but gcc was leaving more than that already
push 'xbcd' ; ESP is now pointing to the string data we just pushed
push esp ; pushes the old value. (push stack-pointer costs 1 extra uop on Intel CPUs, and AMD Ryzen, but the LEA or MOV we avoid would also be a uop).
call puts
Making the compiler store into c[] and then reloading that inside the asm statement is just silly. You could achieve this by passing in the data as a 4-byte integer with an "ri" constraint. Or maybe using if (__builtin_constant_p(data)) { } else { } to branch on whether the data was a compile-time constant or not.
If the contents of c[] aren't supposed to be a compile-time constant, and if we can assume an offset load from c[] won't cause a store-forwarding stall, the general idea of Cody's final version is good:
lea rdi, [d] ; or "mov edi, OFFSET d" if you don't need a 64-bit RIP-relative LEA for PIC code
mov edx, DWORD PTR [c+1] ; load before store to avoid any potential false dependency
mov BYTE PTR [rdi], 'x'
mov DWORD PTR [rdi+1], edx
The lea is only worth it if we need d's address in a register afterwards (which we do in this case for printf / puts). Otherwise it's better to just use [d] and [d+1], even if the addressing mode needs a 32-bit displacement. (It doesn't in this case, since c and d are both on the stack).
Or, if there's padding after d[], and targeting 64-bit, we could load 8 bytes from c (if you know that the load won't cross into another page—a cache-line split on the load or store might also make this not worth it for perf reasons):
lea rdi, [d]
mov rdx, QWORD PTR [c]
mov QWORD PTR [rdi], rdx
mov BYTE PTR [rdi], 'x' ; overlapping store: rewrite the first byte
On some CPUs, e.g. Intel since Ivy Bridge, this will be good even if c[] was just written (avoids the store-forwarding stall):
mov edx, DWORD PTR [c]
mov dl, 'x' ; modify the low byte. reading edx later will cause a partial-reg stall on older Intel CPUs
mov byte ptr[d+4], 0
mov dword ptr[d], edx
There are other ways to replace the first byte, e.g. AND and OR, which avoid problems on older Intel CPUs.
This has the advantage that reading multiple bytes at once from the start of d[] won't suffer a store-forwarding stall, since the first 4 bytes are written with a store aligned to the start of d[].
Combining both previous ideas:
mov rdx, QWORD PTR [c]
mov dl, 'x'
mov QWORD PTR [d], rdx
As usual, the optimal choice strongly depends on context (surrounding code), and on target CPU microarchitecture (Nehalem vs. Skylake vs. Silvermont vs. Bulldozer vs. Ryzen ...)

First of all, your code for string copy did not result in an exception, when I built using gcc and executed on my Windows PC. However, the string copy was not happening because your code appears to assume that register ecx points to variable d, when it actually points to variable c. The following code copies string contents of variable c to d, then replaces the first character in array d, with x. Try Compiling with gcc.
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb (%%ecx), %%al\n"
"movb %%al, (%%ebx)\n"
"movb 1(%%ecx), %%al\n"
"movb %%al, 1(%%ebx)\n"
"movb 2(%%ecx), %%al\n"
"movb %%al, 2(%%ebx)\n"
"movb 3(%%ecx), %%al\n"
"movb %%al, 3(%%ebx)\n"
"movb 4(%%ecx), %%al\n"
"movb %%al, 4(%%ebx)\n"
"movb $'x', (%%ebx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("String d is: %s\n", d);
printf("String c remains: %s\n", c);
return 0;
}
When using MinGW gcc compiler on Windows PC, the following out put is produced:
> gcc testAsm.c
> .\a.exe
String d is: xbcd
String c remains: abcd

Convert Pentium II timing code into inline assembly?

I am trying to use the following code in GCC. It is throwing errors(I guess because of __asm). Why is this simple and easy format is not working in GCC? Syntax of extended assembly is provided here. I am getting confused, when it comes to use of more variables in the inline assembly. Can some one convert the following program to appropriate form and give necessary explanation where ever there is use of variables.
int time, subtime;
float x = 5.0f;
__asm {
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax // Only the last value of subtime is kept
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions
fld x
fld x
cpuid // Serialize execution
rdtsc // Read time stamp to EAX
mov time, eax
fdiv // Perform division
cpuid // Serialize again for time-stamp read
rdtsc
sub eax, time // Find the difference
mov time, eax
}
.

Your question is effectively a code conversion question, which is generally off-topic for Stackoverflow. An answer however may be beneficial to other readers.
This code is a conversion of the original source material, and is not meant as an enhancement. The actual FDIV/FDIVP and the FLD can be reduced to a single FLD and a FDIV/FDIVP since you are dividing a float value by itself. As Peter Cordes points out though, you can just load the top of stack with a value 1.0 with FLD1. This would work since dividing any number by itself (besides 0.0) will take the same time as dividing 5.0 by itself. This would remove the need for passing the variable x into the assembler template.
The code you are using is a variation of what was documented by Intel 20 years ago for the Pentium IIs. A discussion of what is going on for that processor is described. The variation is that the code you are using doesn't do the warm up described in that document. I do not believe this mechanism will work overly well on modern processors and OSes (be warned).
The code in question is intended to measure time it takes for a single FDIV instruction to complete. Assuming you actually want to convert this specific code you will have to use GCC extended assembler templates. Extended assembler templates are not easy to use for a first time GCC developer. For assembler code you might even consider putting the code into a separate assembly file, assemble it separately, and call it from C.
Assembler templates use input constraints and output constraints to pass data into and out of the template (unlike MSVC).It also uses a clobber list to specify registers that may have been altered that don't appear as an input or output. By default GCC inline assembly uses ATT syntax instead of INTEL.
The equivalent code using extended assembler with ATT syntax could look like this:
#include <stdio.h>
int main()
{
int time, subtime;
float x = 5.0f;
int temptime;
__asm__ (
"rdtsc\n\t"
"mov %%eax, %[subtime]\n\t"
"cpuid\n\t"
"rdtsc\n\t"
"sub %[subtime], %%eax\n\t"
"mov %%eax, %[subtime]\n\t"
/* Only the last value of subtime is kept
* subtime should now represent the overhead cost of the
* MOV and CPUID instructions */
"flds %[x]\n\t"
"flds %[x]\n\t" /* Alternatively use fst to make copy */
"cpuid\n\t" /* Serialize execution */
"rdtsc\n\t" /* Read time stamp to EAX */
"mov %%eax, %[temptime]\n\t"
"fdivp\n\t" /* Perform division */
"cpuid\n\t" /* Serialize again for time-stamp read */
"rdtsc\n\t"
"sub %[temptime], %%eax\n\t"
"fstp %%st(0)\n\t" /* Need to clear FPU stack before returning */
: [time]"=a"(time), /* 'time' is returned via the EAX register */
[subtime]"=r"(subtime), /* return reg for subtime */
[temptime]"=r"(temptime) /* Temporary reg for computation
This allows compiler to choose
a register for temporary use. Register
only for BOTH so subtime and temptime
calc are based on a mov reg, reg */
: [x]"m"(x) /* X is a MEMORY reference (required by FLD) */
: "ebx", "ecx", "edx"); /* Registers clobbered by CPUID
but not listed as input/output
operands */
time = time - subtime; /* Subtract the overhead */
printf ("%d\n", time); /* Print total time of divide to screen */
return 0;
}

gcc, icc and visual c, they all have very different syntax for inline assembler (This is not part of the C standard). The GCC is a bit more complex, but also more efficient, since you tell the compiler which registers are used for what, and which registers that are clobbered (used).
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
http://asm.sourceforge.net/articles/rmiyagi-inline-asm.txt
My gcc assembler is a bit rusty (a couple of years since I played with it), so there might be some mistakes there
int main(int argc, char *argv[])
{
int time=0, subtime = 100;
const float x = 5.0f;
asm (
"xorl %%eax, %%eax \n" /* make sure eax is a known value befeore cpuid */
"cpuid \n"
"rdtsc \n"
"movl %%eax, %[aSubtime] \n"
"cpuid \n"
"rdtsc \n"
"subl %[aSubtime], %%eax \n"
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions
"fld %[ax] \n"
"fld %[ax] \n"
"cpuid \n" // Serialize execution
"rdtsc \n" // Read time stamp to EAX
"movl %%eax, %[atime] \n"
"fdivp \n" // Perform division
"cpuid \n" // Serialize again for time-stamp read
"rdtsc \n"
"subl %[atime], %%eax \n"
// "movl %%eax, %2 \n" Not needed, since we tell the compiler that asm exists with time in eax
: "=a" (time) /* time is outputed in eax */
: [aSubtime] "m" (subtime),
[ax] "m" (x),
[atime] "m" (time)
: "ebx", "ecx", "edx"
);
/* FPU is currently left in a pushed state here */
return 0;
}

How do I return a byte or short from inline assembly

When I compile:
short foo (short x)
{
__asm ("mov %0, 0" : "=r" (x));
return x;
}
I get
foo:
mov r0, 0
sxth r0, r0
bx lr
I don't want the extra instruction sxth (signed extract halfword).
The reason it is there is because the ABI (ARM AEABI in this case) requires that the function return value is sign extended to a full register, but GCC doesn't know what my assembly code might have done to the register it gave me.
How do I tell GCC that I promise to leave the register correctly sign extended so it doesn't have to insert this extra instruction?
I would have thought that it could infer this from the fact that I gave it a short as the expression for the "=r" constraint.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio