The different between with GCC inline assembly and VC - gcc

I am migrate VC inline assembly code into GCC inline assembly code.
#ifdef _MSC_VER
// this is raw code.
__asm
{
cmp cx, 0x2e
mov dword ptr ds:[esi*4+0x57F0F0], edi
jmp BW::BWFXN_RefundMin4ReturnAddress
}
#else
// this is my code.
asm
(
"cmp $0x2e, %%cx\n\t"
"movl %%edi, $ds:0x57F0F0(0, %%esi, 4)\n\t"
"jmp %0"
: /* no output */
: "i"(BW::BWFXN_RefundGas3ReturnAddress)
: "cx"
);
#endif
But I got error
Error:junk `:0x57F0F0(0,%esi,4)' after expression
/var/.../cckr7pkp.s:3034: Error:operand type mismatch for `mov'
/var/.../cckr7pkp.s:3035: Error:operand type mismatch for `jmp'
Refer the Address operand syntax
segment:displacement(base register, offset register, scalar multiplier)
is equivalent to
segment:[base register + displacement + offset register * scalar multiplier]
in Intel syntax.
I don't know where is the issue.

This is highly unlikely to work just from getting the syntax correct, because you're depending on values in registers being set to something before the asm statement, and you aren't using any input operands to make that happen. (And for some reason you need to set flags with cmp before jumping?)
If that fragment worked on its own somehow in MSVC, then your code depends on the choices made by MSVC's optimizer (as far as which C value is in which register), which seems insane.
Anyway, the first answer to any inline asm question is https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Now might be a good time to rewrite your thing in C (maybe with some __builtin functions if needed).
You should use asm volatile and a "memory" clobber at the very least. The compiler assumes that execution continues after the asm statement, but at least this will make sure it stores everything to memory before the asm, i.e. it's a full memory barrier (against compile-time reordering). But any destructors at the end of the function (or in the callers) won't run, and no stack cleanup will happen; there's really no way to make this safe.
You might be able to use asm goto, but that might only work for labels within the same function.
As far as syntax goes, leave out %%ds: because it's the default segment anyway. (Everything after $ds was considered junk because $ds is the address of the symbol ds. Register names start with %.) Also, just leave out the base entirely, instead of using zero. Use
"movl %%edi, 0x57F0F0( ,%%esi, 4) \n\t"
You could have got a disassembler to tell you how to write that, by assembling the Intel version and disassembling in AT&T syntax.
You can probably implement that store in pure C easily enough, e.g. int32_t *p = (int32_t *)0x57F0F0; p[foo]=bar;.
For the jmp operand, use %c0 to get the address with no $ so the compiler's asm output is jmp 0x12345 instead of jmp $0x12345. See also https://stackoverflow.com/tags/inline-assembly/info for more links to guides + docs.
You can and should look at the gcc -O2 -S output to see what the compiler is feeding to the assembler. i.e. how exactly it's filling in the asm template.
I tested this on Godbolt to make sure it compiles, and see the asm output + disassembly output
void ext(void);
long foo(int a, int b) { return 0; }
static const unsigned my_addr = 0x000045678;
//__attribute__((noinline))
void testasm(void)
{
asm volatile( // and still not safe in general
"movl %%edi, 0x57F0F0( ,%%esi, 4) \n\t"
"jmp %c[foo] \n\t"
"jmp foo \n\t"
"jmp 0x12345 \n\t"
"jmp %c[addr] "
: // no outputs
: // "S" (value_for_esi), "D" (value_for_edi)
[foo] "i" (foo),
[addr] "i" (my_addr)
: "memory" // prevents stores from sinking past this
);
// make sure gcc doesn't need to call any destructors here
// or in our caller
// because jumping away will mean they don't run
}
Notice that an "i" (foo) constraint and a %c[operand] (or %c0) will produce a jmp foo in the asm output, so you can emit a direct jmp by pretending you're using a function pointer.
This also works for absolute addresses. x86 machine code can't encode a direct jump, but GAS asm syntax will let you write a jump target as an absolute numeric address. The linker will fill in the right rel32 offset to reach the absolute address from wherever the jmp ends up.
So your inline asm template just needs to produce jmp 0x12345 as input to the assembler to get a direct jump.
asm output for testasm:
movl %edi, 0x57F0F0( ,%esi, 4)
jmp foo #
jmp foo
jmp 0x12345
jmp 284280 # constant substituted by the compiler from a static const unsigned C variable
ret
disassembly output:
mov %edi,0x57f0f0(,%esi,4)
jmp 80483f0 <foo>
jmp 80483f0 <foo>
jmp 12345 <_init-0x8035f57>
jmp 45678 <_init-0x8002c24>
ret
Note that the jump targets decoded to absolute addresses in hex. (Godbolt doesn't give easy access to copy/paste the raw machine code, but you can see it on mouseover of the left column.)
This only works in position-dependent code (not PIC), otherwise absolute relocations aren't possible. Note that many recent Linux distros ship gcc set to use -pie by default to enable ASLR for 64-bit executables, so you may need -no-pie -fno-pie to make this work, or else ask for the address in a register (r constraint and jmp *%[addr]) to actually do an indirect jump to an absolute address instead of a relative jump.

Related

which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension

I am modelling a custom MOV instruction in the X86 architecture in the gem5 simulator, to test its implementation on the simulator, I need to compile my C code using inline assembly to create a binary file. But since it a custom instruction which has not been implemented in the GCC compiler, the compiler will throw out an error. I know one way is to extend the GCC compiler to accept my custom X86 instruction, but I do not want to do it as it is more time consuming(but will do it afterwards).
As a temporary hack (just to check if my implementation is worth it or not). I want to edit an already MOV instruction while changing its underlying "micro ops" in the simulator so as to trick the GCC to accept my "custom" instruction and compile.
As they are many types of MOV instructions which are available in the x86 architecture. As they are various MOV Instructions in the 86 architecture reference.
Therefore coming to my question, which MOV instruction is the least used and that I can edit its underlying micro-ops. Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers and my instructions mirrors the same implementation of a MOV instruction.
Your best bet is regular mov with a prefix that GCC will never emit on its own. i.e. create a new mov encoding that includes a mandatory prefix in front of any other mov. Like how lzcnt is rep bsr.
Or if you're modifying GCC and as, you can add a new mnemonic that just uses otherwise-invalid (in 64-bit mode) single byte opcodes for memory-source, memory-dest, and immediate-source versions of mov. AMD64 freed up several opcodes, including the BCD instructions like AAM, and push/pop most segment registers. (x86-64 can still mov to/from Sregs, but there's just 1 opcode per direction, not 2 per Sreg for push ds/pop ds etc.)
Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers
Bad assumption for XMM: GCC aggressively uses 16-byte movaps / movups instead of copying structs 4 or 8 bytes at a time. It's not at all rare to find vector mov instructions in scalar integer code as part of inline expansion of small known-length memcpy or struct / array init. Also, those mov instructions have at least 2-byte opcodes (SSE1 0F 28 movaps, so a prefix in front of plain mov is the same size as your idea would have been).
However, you're right about MMX regs. I don't think modern GCC will ever emit movq mm0, mm1 or use MMX at all, unless you use MMX intrinsics. Definitely not when targeting 64-bit code.
Also mov to/from control regs (0f 21/23 /r) or debug registers (0f 20/22 /r) are both the mov mnemonic, but gcc will definitely never emit either on its own. Only available with GP register operands as the operand that isn't the debug or control register. So that's technically the answer to your title question, but probably not what you actually want.
GCC doesn't parse its inline asm template string, it just includes it in its asm text output to feed to the assembler after substituting for %number operands. So GCC itself is not an obstacle to emitting arbitrary asm text using inline asm.
And you can use .byte to emit arbitrary machine code.
Perhaps a good option would be to use a 0E byte as a prefix for your special mov encoding that you're going to make GEM decode specially. 0E is push CS in 32-bit mode, invalid in 64-bit mode. GCC will never emit either.
Or just an F2 repne prefix; GCC will never emit repne in front of a mov opcode (where it doesn't apply), only movs. (F3 rep / repe means xrelease when used on a memory-destination instruction so don't use that. https://www.felixcloutier.com/x86/xacquire:xrelease says that F2 repne is the xacquire prefix when used with locked instructions, which doesn't include mov to memory so it will be silently ignored there.)
As usual, prefixes that don't apply have no documented behaviour, but in practice CPUs that don't understand a rep / repne ignore it. Some future CPU might understand it to mean something special, and that's exactly what you're doing with GEM.
Picking .byte 0x0e; instead of repne; might be a better choice if you want to guard against accidentally leaving these prefixes in a build you run on a real CPU. (It will #UD -> SIGILL in 64-bit mode, or usually crash from messing up the stack in 32-bit mode.) But if you do want to be able to run the exact same binary on a real CPU, with the same code alignment and everything, then an ignored REP prefix is ideal.
Using a prefix in front of a standard mov instruction has the advantage of letting the assembler encode the operands for you:
template<class T>
void fancymov(T& dst, T src) {
// fixme: imm -> mem needs a size suffix, defeating template
// unless you use Intel-syntax where the operand includes "dword ptr"
asm("repne; movl %1, %0"
#if 1
: "=m"(dst)
: "ri" (src)
#else
: "=g,r"(dst)
: "ri,rmi" (src)
#endif
: // no clobbers
);
}
void test(int *dst, long src) {
fancymov(*dst, (int)src);
fancymov(dst[1], 123);
}
(Multi-alternative constraints let the compiler pick either reg/mem destination or reg/mem source. In practice it prefers the register destination even when that will cost it another instruction to do its own store, so that sucks.)
On the Godbolt compiler explorer, for the version that only allows a memory-destination:
test(int*, long):
repne; movl %esi, (%rdi) # F2 E9 37
repne; movl $123, 4(%rdi) # F2 C7 47 04 7B 00 00 00
ret
If you wanted this to be usable for loads, I think you'd have to make 2 separate versions of the function and use the load version or store version manually, where appropriate, because GCC seems to want to use reg,reg whenever it can.
Or with the version allowing register outputs (or another version that returns the result as a T, see the Godbolt link):
test2(int*, long):
repne; mov %esi, %esi
repne; mov $123, %eax
movl %esi, (%rdi)
movl %eax, 4(%rdi)
ret

Can I get rid of a sign-extend between CTZ and addition to a pointer?

For code such as this:
#include <stdint.h>
char* ptrAdd(char* ptr, uint32_t x)
{
return ptr + (uint32_t)__builtin_ctz(x);
}
GCC generates a sign-extension: (godbolt link)
xor eax, eax
rep bsf eax, esi
cdqe ; sign-extend eax into rax
add rax, rdi
ret
This is, of course, completely redundant - this is blatantly sign-extending an unsigned integer. Can I convince GCC not to do this?
The problem exists since GCC 4.9.0, but before that it used to be an explicit zero-extension which is also redundant.
A partial solution is to use the 64-bit version of ctz, along with a -march argument so that tzcnt is used instead of bsf, like so:
char* ptrAdd(char* ptr, uint32_t x)
{
return ptr + __builtin_ctzl(x);
}
This results in no sign extension:
ptrAdd(char*, unsigned int):
mov eax, esi
tzcnt rax, rax
add rax, rdi
ret
It has a mov (to do the 32 to 64-bit zero extension) which replaced a zeroing xor in the 32-bit version (which was there to work around the tzcnt false-dependency-on-destination issue). Those are about the same cost, but the mov is more likely to disappear after inlining. The result of a 64-bit tzcnt is the same as a 32-bit one, except for the case of zero input which is undefined (as far as the gcc intrinsic goes, not tzcnt).
Unfortunately, without a -march argument that lets the compiler use tzcnt it will use bsf and in that case still does the sign extension.
It seems that the origin of the differing behavior between bsf and tzcnt is that in the case that bsf version is used, the instruction behavior is undefined at zero. So in principle, the instruction could return anything, even values outside the range 0 to 63 that we would normally expect. Combined with the fact that the return value is declared as int, simply omitting the sign extension could lead to "impossible" situations like (__builtin_clzl (x) & 0xff) == 0xdeadbeef.
Now per the gcc docs, zero input to __builtin_ctzl has an "undefined result" - but it isn't clear if this is the same as C/C++ "undefined behavior" where anything can happen (which would allow impossible things), or just means "some unspecified value".
You can read about this on the gcc bugzilla, where an issue has been open for about 7 years.

gcc, __atomic_exchange seems to produce non-atomic asm, why?

I am working on a nice tool, which requires the atomic swap of two different 64-bit values. On the amd64 architecture it is possible with the XCHGQ instruction (see here in doc, warning: it is a long pdf).
Correspondingly, gcc has some atomic builtins which would ideally do the same, as it is visible for example here.
Using these 2 docs I produced the following simple C function, for the atomic swapping of two, 64-bit values:
void theExchange(u64* a, u64* b) {
__atomic_exchange(a, b, b, __ATOMIC_SEQ_CST);
};
(Btw, it wasn't really clear to me, why needs an "atomic exchange" 3 operands.)
It was to me a little bit fishy, that the gcc __atomic_exchange macro uses 3 operands, so I tested its asm output. I compiled this with a gcc -O6 -masm=intel -S and I've got the following output:
.LHOTB0:
.p2align 4,,15
.globl theExchange
.type theExchange, #function
theExchange:
.LFB16:
.cfi_startproc
mov rax, QWORD PTR [rsi]
xchg rax, QWORD PTR [rdi] /* WTF? */
mov QWORD PTR [rsi], rax
ret
.cfi_endproc
.LFE16:
.size theExchange, .-theExchange
.section .text.unlikely
As we can see, the result function contains not only a single data move, but three different data movements. Thus, as I understood this asm code, this function won't be really atomic.
How is it possible? Maybe I misunderstood some of the docs? I admit, the gcc builtin doc wasn't really clear to me.
This is the generic version of __atomic_exchange_n (type *ptr, type val, int memorder) where only the exchange operation on ptr is atomic, the reading of val is not. In the generic version, val is accessed via pointer, but the atomicity still does not apply to it. The pointer is so that it will work with multiple sizes, when the compiler has to call an external helper:
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in
function if the specific data type size makes that possible;
otherwise, an external call is left to be resolved at run time. This
external call is the same format with the addition of a ‘size_t’
parameter inserted as the first parameter indicating the size of the
object being pointed to. All objects must be the same size.

Using GCC intel inline assembly to insert jmps to absolute 64 bit addresses

I am trying to perform a live patch to some 64 bit intel code.
Basically, I like to replace a few instructions with a larger set of instructions.
The patching will be performed by a program written in C, and I like to make use of GCC's inline assembler for that.
What I am struggling with is how to calculate the jmp and call instructions, knowing only the absolute 64bit addresses of the original code and my replacement routine.
To give some background, I am trying to perform the patch that's outlined here: Hacking the OS X Kernel for Fun and Profiles.
But instead of modifying the kernel file, I am writing a KEXT that is already able to locate the relevant set of these 6 instructions that need to be changed:
mov %r15,%rdi
xor %esi,%esi
xor %edx,%edx
xor %ecx,%ecx
mov $0x1b,%r8d
callq <threadsignal+224>
I know the address of each of these above instructions.
Now, I would want to write the replacement code like this:
void patchcode ()
{
current_thread (); // leaves result in rax, apparently
__asm__ volatile (
"xor %edi,%edi \n"
"xor %esi,%esi \n"
"mov %rax,%rdx \n" // current_thread()'s result
"mov $0x4,%ecx \n"
"pushl $0x01234567 \n"
"pushl $0x89ABCDEF \n"
"ret \n"
);
}
The challenges for me are:
Apparently, the compiler puts some code at the start of the patchcode() function that changes the stack. That's not good because I leave the routine with a jmp, not by the function's usual return path. How do I deal with that, can I tell the compiler not to generate that lead-in code?
How do I code the two jmps, one jmp from the to-be-patched area to the above patchcode() function, and then the jmp inside patchcode() back to the instruction of which I only know the absolute address, at runtime? As you can see above, I imagine pushing two 32 bit values onto the stack (I'll patch those values at runtime once I know the actual address where I want to jump to) and then jump to them using a ret instruction. Will that work?

How does gcc know the register size to use in inline assembly?

I have the inline assembly code:
#define read_msr(index, buf) asm volatile ("rdmsr" : "=d"(buf[1]), "=a"(buf[0]) : "c"(index))
The code using this macro:
u32 buf[2];
read_msr(0x173, buf);
I found the disassembly is (using gnu toolchain):
mov eax,0x173
mov ecx,eax
rdmsr
mov DWORD PTR [rbp-0xc],edx
mov DWORD PTR [rbp-0x10],eax
The question is that 0x173 is less than 0xffff, why gcc does not use mov cx, 0x173? Will the gcc analysis the following instruction rdmsr? Will the gcc always know the correct register size?
It depends on the size of the value or variable passed.
If you pass a "short int" it will set "cx" and read the data from "ax" and "dx" (if buf is a short int, too).
For char it would access "cl" and so on.
So "c" refers to the "ecx" register, but this is accessed with "ecx", "cx", or "cl" depending on the size of the access, which I think makes sense.
To test you can try passing (unsigned short)0x173, it should change the code.
There is no analysis of the inline assembly (in fact it is after text substitution direclty copied to the output assembly, including syntax errors). Also there is no default register size, depending on whether you have a 32 or 64 bit target. This would be way to limiting.
I think the answer is because the current default data size is 32-bit. In 64-bit long mode, the default data size is also 32-bit, unless you use "rex.w" prefix.
Intel specifies the RDMSR instruction as using (all of) ECX to determine the model specific register. That being the case, and apparently as specified by your macro, GCC has every reason to load your constant into the full ECX.
So the question about why it doesn't load CX seems completely inappropriate. It looks like GCC is generating the right code.
(You didn't ask why it stages the load of ECX inefficiently by using EAX; I don't know the answer to that).

Resources