Using GCC intel inline assembly to insert jmps to absolute 64 bit addresses - macos

I am trying to perform a live patch to some 64 bit intel code.
Basically, I like to replace a few instructions with a larger set of instructions.
The patching will be performed by a program written in C, and I like to make use of GCC's inline assembler for that.
What I am struggling with is how to calculate the jmp and call instructions, knowing only the absolute 64bit addresses of the original code and my replacement routine.
To give some background, I am trying to perform the patch that's outlined here: Hacking the OS X Kernel for Fun and Profiles.
But instead of modifying the kernel file, I am writing a KEXT that is already able to locate the relevant set of these 6 instructions that need to be changed:
mov %r15,%rdi
xor %esi,%esi
xor %edx,%edx
xor %ecx,%ecx
mov $0x1b,%r8d
callq <threadsignal+224>
I know the address of each of these above instructions.
Now, I would want to write the replacement code like this:
void patchcode ()
{
current_thread (); // leaves result in rax, apparently
__asm__ volatile (
"xor %edi,%edi \n"
"xor %esi,%esi \n"
"mov %rax,%rdx \n" // current_thread()'s result
"mov $0x4,%ecx \n"
"pushl $0x01234567 \n"
"pushl $0x89ABCDEF \n"
"ret \n"
);
}
The challenges for me are:
Apparently, the compiler puts some code at the start of the patchcode() function that changes the stack. That's not good because I leave the routine with a jmp, not by the function's usual return path. How do I deal with that, can I tell the compiler not to generate that lead-in code?
How do I code the two jmps, one jmp from the to-be-patched area to the above patchcode() function, and then the jmp inside patchcode() back to the instruction of which I only know the absolute address, at runtime? As you can see above, I imagine pushing two 32 bit values onto the stack (I'll patch those values at runtime once I know the actual address where I want to jump to) and then jump to them using a ret instruction. Will that work?

Related

which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension

I am modelling a custom MOV instruction in the X86 architecture in the gem5 simulator, to test its implementation on the simulator, I need to compile my C code using inline assembly to create a binary file. But since it a custom instruction which has not been implemented in the GCC compiler, the compiler will throw out an error. I know one way is to extend the GCC compiler to accept my custom X86 instruction, but I do not want to do it as it is more time consuming(but will do it afterwards).
As a temporary hack (just to check if my implementation is worth it or not). I want to edit an already MOV instruction while changing its underlying "micro ops" in the simulator so as to trick the GCC to accept my "custom" instruction and compile.
As they are many types of MOV instructions which are available in the x86 architecture. As they are various MOV Instructions in the 86 architecture reference.
Therefore coming to my question, which MOV instruction is the least used and that I can edit its underlying micro-ops. Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers and my instructions mirrors the same implementation of a MOV instruction.
Your best bet is regular mov with a prefix that GCC will never emit on its own. i.e. create a new mov encoding that includes a mandatory prefix in front of any other mov. Like how lzcnt is rep bsr.
Or if you're modifying GCC and as, you can add a new mnemonic that just uses otherwise-invalid (in 64-bit mode) single byte opcodes for memory-source, memory-dest, and immediate-source versions of mov. AMD64 freed up several opcodes, including the BCD instructions like AAM, and push/pop most segment registers. (x86-64 can still mov to/from Sregs, but there's just 1 opcode per direction, not 2 per Sreg for push ds/pop ds etc.)
Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers
Bad assumption for XMM: GCC aggressively uses 16-byte movaps / movups instead of copying structs 4 or 8 bytes at a time. It's not at all rare to find vector mov instructions in scalar integer code as part of inline expansion of small known-length memcpy or struct / array init. Also, those mov instructions have at least 2-byte opcodes (SSE1 0F 28 movaps, so a prefix in front of plain mov is the same size as your idea would have been).
However, you're right about MMX regs. I don't think modern GCC will ever emit movq mm0, mm1 or use MMX at all, unless you use MMX intrinsics. Definitely not when targeting 64-bit code.
Also mov to/from control regs (0f 21/23 /r) or debug registers (0f 20/22 /r) are both the mov mnemonic, but gcc will definitely never emit either on its own. Only available with GP register operands as the operand that isn't the debug or control register. So that's technically the answer to your title question, but probably not what you actually want.
GCC doesn't parse its inline asm template string, it just includes it in its asm text output to feed to the assembler after substituting for %number operands. So GCC itself is not an obstacle to emitting arbitrary asm text using inline asm.
And you can use .byte to emit arbitrary machine code.
Perhaps a good option would be to use a 0E byte as a prefix for your special mov encoding that you're going to make GEM decode specially. 0E is push CS in 32-bit mode, invalid in 64-bit mode. GCC will never emit either.
Or just an F2 repne prefix; GCC will never emit repne in front of a mov opcode (where it doesn't apply), only movs. (F3 rep / repe means xrelease when used on a memory-destination instruction so don't use that. https://www.felixcloutier.com/x86/xacquire:xrelease says that F2 repne is the xacquire prefix when used with locked instructions, which doesn't include mov to memory so it will be silently ignored there.)
As usual, prefixes that don't apply have no documented behaviour, but in practice CPUs that don't understand a rep / repne ignore it. Some future CPU might understand it to mean something special, and that's exactly what you're doing with GEM.
Picking .byte 0x0e; instead of repne; might be a better choice if you want to guard against accidentally leaving these prefixes in a build you run on a real CPU. (It will #UD -> SIGILL in 64-bit mode, or usually crash from messing up the stack in 32-bit mode.) But if you do want to be able to run the exact same binary on a real CPU, with the same code alignment and everything, then an ignored REP prefix is ideal.
Using a prefix in front of a standard mov instruction has the advantage of letting the assembler encode the operands for you:
template<class T>
void fancymov(T& dst, T src) {
// fixme: imm -> mem needs a size suffix, defeating template
// unless you use Intel-syntax where the operand includes "dword ptr"
asm("repne; movl %1, %0"
#if 1
: "=m"(dst)
: "ri" (src)
#else
: "=g,r"(dst)
: "ri,rmi" (src)
#endif
: // no clobbers
);
}
void test(int *dst, long src) {
fancymov(*dst, (int)src);
fancymov(dst[1], 123);
}
(Multi-alternative constraints let the compiler pick either reg/mem destination or reg/mem source. In practice it prefers the register destination even when that will cost it another instruction to do its own store, so that sucks.)
On the Godbolt compiler explorer, for the version that only allows a memory-destination:
test(int*, long):
repne; movl %esi, (%rdi) # F2 E9 37
repne; movl $123, 4(%rdi) # F2 C7 47 04 7B 00 00 00
ret
If you wanted this to be usable for loads, I think you'd have to make 2 separate versions of the function and use the load version or store version manually, where appropriate, because GCC seems to want to use reg,reg whenever it can.
Or with the version allowing register outputs (or another version that returns the result as a T, see the Godbolt link):
test2(int*, long):
repne; mov %esi, %esi
repne; mov $123, %eax
movl %esi, (%rdi)
movl %eax, 4(%rdi)
ret

Translating Go assembler to NASM

I came across the following Go code:
type Element [12]uint64
//go:noescape
func CSwap(x, y *Element, choice uint8)
//go:noescape
func Add(z, x, y *Element)
where the CSwap and Add functions are basically coming from an assembly, and look like the following:
TEXT ·CSwap(SB), NOSPLIT, $0-17
MOVQ x+0(FP), REG_P1
MOVQ y+8(FP), REG_P2
MOVB choice+16(FP), AL // AL = 0 or 1
MOVBLZX AL, AX // AX = 0 or 1
NEGQ AX // RAX = 0x00..00 or 0xff..ff
MOVQ (0*8)(REG_P1), BX
MOVQ (0*8)(REG_P2), CX
// Rest removed for brevity
TEXT ·Add(SB), NOSPLIT, $0-24
MOVQ z+0(FP), REG_P3
MOVQ x+8(FP), REG_P1
MOVQ y+16(FP), REG_P2
MOVQ (REG_P1), R8
MOVQ (8)(REG_P1), R9
MOVQ (16)(REG_P1), R10
MOVQ (24)(REG_P1), R11
// Rest removed for brevity
What I try to do is that translate the assembly to a syntax that is more familiar to me (I think mine is more like NASM), while the above syntax is Go assembler. Regarding the Add method I didn't have much problem, and translated it correctly (according to test results). It looks like this in my case:
.text
.global add_asm
add_asm:
push r12
push r13
push r14
push r15
mov r8, [reg_p1]
mov r9, [reg_p1+8]
mov r10, [reg_p1+16]
mov r11, [reg_p1+24]
// Rest removed for brevity
But, I have a problem when translating the CSwap function, I have something like this:
.text
.global cswap_asm
cswap_asm:
push r12
push r13
push r14
mov al, 16
mov rax, al
neg rax
mov rbx, [reg_p1+(0*8)]
mov rcx, [reg_p2+(0*8)]
But this doesn't seem to be quite correct, as I get error when compiling it. Any ideas how to translate the above CSwap assembly part to something like NASM?
EDIT (SOLUTION):
Okay, after the two answers below, and some testing and digging, I found out that the code uses the following three registers for parameter passing:
#define reg_p1 rdi
#define reg_p2 rsi
#define reg_p3 rdx
Accordingly, rdx has the value of the choice parameter. So, all that I had to do was use this:
movzx rax, dl // Get the lower 8 bits of rdx (reg_p3)
neg rax
Using byte [rdx] or byte [reg_3] was giving an error, but using dl seems to work fine for me.
Basic docs about Go's asm: https://golang.org/doc/asm. It's not totally equivalent to NASM or AT&T syntax: FP is a pseudo-register name for whichever register it decides to use as the frame pointer. (Typically RSP or RBP). Go asm also seems to omit function prologue (and probably epilogue) instructions. As #RossRidge comments, it's a bit more like a internal representation like LLVM IR than truly asm.
Go also has its own object-file format, so I'm not sure you can make Go-compatible object files with NASM.
If you want to call this function from something other than Go, you'll also need to port the code to a different calling convention. Go appears to be using a stack-args calling convention even for x86-64, unlike the normal x86-64 System V ABI or the x86-64 Windows calling convention. (Or maybe those mov function args into REG_P1 and so on instructions disappear when Go builds this source for a register-arg calling convention?)
(This is why you could you had to use movzx eax, dl instead of loading from the stack at all.)
BTW, rewriting this code in C instead of NASM would probably make even more sense if you want to use it with C. Small functions are best inlined and optimized away by the compiler.
It would be a good idea to check your translation, or get a starting point, by assembling with the Go assembler and using a disassembler.
objdump -drwC -Mintel or Agner Fog's objconv disassembler would be good, but they don't understand Go's object-file format. If Go has a tool to extract the actual machine code or get it in an ELF object file, do that.
If not, you could use ndisasm -b 64 (which treats input files as flat binaries, disassembling all the bytes as if they were instructions). You can specify an offset/length if you can find out where the function starts. x86 instructions are variable length, and disassembly will likely be "out of sync" at the start of the function. You might want to add a bunch of single-byte NOP instructions (kind of a NOP sled) for the disassembler, so if it decodes some 0x90 bytes as part of an immediate or disp32 for a long instruction that was really not part of the function, it will be in sync. (But the function prologue will still be messed up).
You might add some "signpost" instructions to your Go asm functions to make it easy to find the right place in the mess of crazy asm from disassembling metadata as instructions. e.g. put a pmuludq xmm0, xmm0 in there somewhere, or some other instruction with a unique mnemonic that you can search for which the Go code doesn't include. Or an instruction with an immediate that will stand out, like addq $0x1234567, SP. (An instruction that will crash so you don't forget to take it out again is good here.)
Or you could use gdb's built-in disassembler: add an instruction that will segfault (like a load from a bogus absolute address (movl 0, AX null-pointer deref), or a register holding a non-pointer value e.g. movl (AX), AX). Then you'll have an instruction-pointer value for the instructions in memory, and can disassemble from some point behind that. (Probably the function start will be 16-byte aligned.)
Specific instructions.
MOVBLZX AL, AX reads AL, so that's definitely an 8-bit operand. The size for AX is given by the L part of the mnemonic, meaning long for 32 bit, like in GAS AT&T syntax. (The gas mnemonic for that form of movzx is movzbl %al, %eax). See What does cltq do in assembly? for a table of cdq / cdqe and the AT&T equivalent, and the AT&T / Intel mnemonic for the equivalent MOVSX instruction.
The NASM instruction you want is movzx eax, al. Using rax as the destination would be a waste of a REX prefix. Using ax as the destination would be a mistake: it wouldn't zero-extend into the full register, and would leave whatever high garbage. Go asm syntax for x86 is very confusing when you're not used to it, because AX can mean AX, EAX, or RAX depending on the operand size.
Obviously mov rax, al isn't a possibility: Like most instructions, mov requires both its operands to be the same size. movzx is one of the rare exceptions.
MOVB choice+16(FP), AL is a byte load into AL, not an immediate move. choice+16 is a an offset from FP. This syntax is basically the same as AT&T addressing modes, with FP as a register and choice as an assemble-time constant.
FP is a pseudo-register name. It's pretty clear that it should simply be loading the low byte of the 3rd arg-passing slot, because choice is the name of a function arg. (In Go asm, choice is just syntactic sugar, or a constant defined as zero.)
Before a call instruction, rsp points at the first stack arg, so that + 16 is the 3rd arg. It appears that FP is that base address (and might actually be rsp+8 or something). After a call (which pushes an 8 byte return address), the 3rd stack arg is at rsp + 24. After more pushes, the offset will be even larger, so adjust as necessary to reach the right location.
If you're porting this function to be called with a standard calling convention, the 3 integer args will be passed in registers, with no stack args. Which 3 registers depends on whether you're building for Windows vs. non-Windows. (See Agner Fog's calling conventions doc: http://agner.org/optimize/)
BTW, a byte load into AL and then movzx eax, al is just dumb. Much more efficient on all modern CPUs to do it in one step with
movzx eax, byte [rsp + 24] ; or rbp+32 if you made a stack frame.
I hope the source in the question is from un-optimized Go compiler output? Or the assembler itself makes such optimizations?
I think you can translate these as just
mov rbx, [reg_p1]
mov rcx, [reg_p2]
Unless I'm missing some subtlety, the offsets which are zero can just be ignored. The *8 isn't a size hint since that's already in the instruction.
The rest of your code looks wrong though. The MOVB choice+16(FP), AL in the original is supposed to be fetching the choice argument into AL, but you're setting AL to a constant 16, and the code for loading the other arguments seems to be completely missing, as is the code for all of the arguments in the other function.

The different between with GCC inline assembly and VC

I am migrate VC inline assembly code into GCC inline assembly code.
#ifdef _MSC_VER
// this is raw code.
__asm
{
cmp cx, 0x2e
mov dword ptr ds:[esi*4+0x57F0F0], edi
jmp BW::BWFXN_RefundMin4ReturnAddress
}
#else
// this is my code.
asm
(
"cmp $0x2e, %%cx\n\t"
"movl %%edi, $ds:0x57F0F0(0, %%esi, 4)\n\t"
"jmp %0"
: /* no output */
: "i"(BW::BWFXN_RefundGas3ReturnAddress)
: "cx"
);
#endif
But I got error
Error:junk `:0x57F0F0(0,%esi,4)' after expression
/var/.../cckr7pkp.s:3034: Error:operand type mismatch for `mov'
/var/.../cckr7pkp.s:3035: Error:operand type mismatch for `jmp'
Refer the Address operand syntax
segment:displacement(base register, offset register, scalar multiplier)
is equivalent to
segment:[base register + displacement + offset register * scalar multiplier]
in Intel syntax.
I don't know where is the issue.
This is highly unlikely to work just from getting the syntax correct, because you're depending on values in registers being set to something before the asm statement, and you aren't using any input operands to make that happen. (And for some reason you need to set flags with cmp before jumping?)
If that fragment worked on its own somehow in MSVC, then your code depends on the choices made by MSVC's optimizer (as far as which C value is in which register), which seems insane.
Anyway, the first answer to any inline asm question is https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Now might be a good time to rewrite your thing in C (maybe with some __builtin functions if needed).
You should use asm volatile and a "memory" clobber at the very least. The compiler assumes that execution continues after the asm statement, but at least this will make sure it stores everything to memory before the asm, i.e. it's a full memory barrier (against compile-time reordering). But any destructors at the end of the function (or in the callers) won't run, and no stack cleanup will happen; there's really no way to make this safe.
You might be able to use asm goto, but that might only work for labels within the same function.
As far as syntax goes, leave out %%ds: because it's the default segment anyway. (Everything after $ds was considered junk because $ds is the address of the symbol ds. Register names start with %.) Also, just leave out the base entirely, instead of using zero. Use
"movl %%edi, 0x57F0F0( ,%%esi, 4) \n\t"
You could have got a disassembler to tell you how to write that, by assembling the Intel version and disassembling in AT&T syntax.
You can probably implement that store in pure C easily enough, e.g. int32_t *p = (int32_t *)0x57F0F0; p[foo]=bar;.
For the jmp operand, use %c0 to get the address with no $ so the compiler's asm output is jmp 0x12345 instead of jmp $0x12345. See also https://stackoverflow.com/tags/inline-assembly/info for more links to guides + docs.
You can and should look at the gcc -O2 -S output to see what the compiler is feeding to the assembler. i.e. how exactly it's filling in the asm template.
I tested this on Godbolt to make sure it compiles, and see the asm output + disassembly output
void ext(void);
long foo(int a, int b) { return 0; }
static const unsigned my_addr = 0x000045678;
//__attribute__((noinline))
void testasm(void)
{
asm volatile( // and still not safe in general
"movl %%edi, 0x57F0F0( ,%%esi, 4) \n\t"
"jmp %c[foo] \n\t"
"jmp foo \n\t"
"jmp 0x12345 \n\t"
"jmp %c[addr] "
: // no outputs
: // "S" (value_for_esi), "D" (value_for_edi)
[foo] "i" (foo),
[addr] "i" (my_addr)
: "memory" // prevents stores from sinking past this
);
// make sure gcc doesn't need to call any destructors here
// or in our caller
// because jumping away will mean they don't run
}
Notice that an "i" (foo) constraint and a %c[operand] (or %c0) will produce a jmp foo in the asm output, so you can emit a direct jmp by pretending you're using a function pointer.
This also works for absolute addresses. x86 machine code can't encode a direct jump, but GAS asm syntax will let you write a jump target as an absolute numeric address. The linker will fill in the right rel32 offset to reach the absolute address from wherever the jmp ends up.
So your inline asm template just needs to produce jmp 0x12345 as input to the assembler to get a direct jump.
asm output for testasm:
movl %edi, 0x57F0F0( ,%esi, 4)
jmp foo #
jmp foo
jmp 0x12345
jmp 284280 # constant substituted by the compiler from a static const unsigned C variable
ret
disassembly output:
mov %edi,0x57f0f0(,%esi,4)
jmp 80483f0 <foo>
jmp 80483f0 <foo>
jmp 12345 <_init-0x8035f57>
jmp 45678 <_init-0x8002c24>
ret
Note that the jump targets decoded to absolute addresses in hex. (Godbolt doesn't give easy access to copy/paste the raw machine code, but you can see it on mouseover of the left column.)
This only works in position-dependent code (not PIC), otherwise absolute relocations aren't possible. Note that many recent Linux distros ship gcc set to use -pie by default to enable ASLR for 64-bit executables, so you may need -no-pie -fno-pie to make this work, or else ask for the address in a register (r constraint and jmp *%[addr]) to actually do an indirect jump to an absolute address instead of a relative jump.

cedecl calling convention -- compiled asm instructions cause crash

Treat this more as pseudocode than anything. If there's some macro or other element that you feel should be included, let me know.
I'm rather new to assembly. I programmed on a pic processor back in college, but nothing since.
The problem here (segmentation fault) is the first instruction after "Compile function entrance, setup stack frame." or "push %ebp". Here's what I found out about those two instructions:
http://unixwiz.net/techtips/win32-callconv-asm.html
Save and update the %ebp :
Now that we're in the new function, we need a new local stack frame pointed to by %ebp, so this is done by saving the current %ebp (which belongs to the previous function's frame) and making it point to the top of the stack.
push ebp
mov ebp, esp // ebp « esp
Once %ebp has been changed, it can now refer directly to the function's arguments as 8(%ebp), 12(%ebp). Note that 0(%ebp) is the old base pointer and 4(%ebp) is the old instruction pointer.
Here's the code. This is from a JIT compiler for a project I'm working on. I'm doing this more for the learning experience than anything.
IL_CORE_COMPILE(avs_x86_compiler_compile)
{
X86GlobalData *gd = X86_GLOBALDATA(ctx);
ILInstruction *insn;
avs_debug(print("X86: Compiling started..."));
/* Initialize X86 Assembler opcode context */
x86_context_init(&gd->ctx, 4096, 1024*1024);
/* Compile function entrance, setup stack frame*/
x86_emit1(&gd->ctx, pushl, ebp);
x86_emit2(&gd->ctx, movl, esp, ebp);
/* Setup floating point rounding mode to integer truncation */
x86_emit2(&gd->ctx, subl, imm(8), esp);
x86_emit1(&gd->ctx, fstcw, disp(0, esp));
x86_emit2(&gd->ctx, movl, disp(0, esp), eax);
x86_emit2(&gd->ctx, orl, imm(0xc00), eax);
x86_emit2(&gd->ctx, movl, eax, disp(4, esp));
x86_emit1(&gd->ctx, fldcw, disp(4, esp));
for (insn=avs_il_tree_base(tree); insn != NULL; insn = insn->next) {
avs_debug(print("X86: Compiling instruction: %p", insn));
compile_opcode(gd, obj, insn);
}
/* Restore floating point rounding mode */
x86_emit1(&gd->ctx, fldcw, disp(0, esp));
x86_emit2(&gd->ctx, addl, imm(8), esp);
/* Cleanup stack frame */
x86_emit0(&gd->ctx, emms);
x86_emit0(&gd->ctx, leave);
x86_emit0(&gd->ctx, ret);
/* Link machine */
obj->run = (AvsRunnableExecuteCall) gd->ctx.buf;
return 0;
}
And when obj->run is called, it's called with obj as its only argument:
obj->run(obj);
If it helps, here are the instructions for the entire function call. It's basically an assignment operation: foo=3*0.2;. foo is pointing to a float in C.
0x8067990: push %ebp
0x8067991: mov %esp,%ebp
0x8067993: sub $0x8,%esp
0x8067999: fnstcw (%esp)
0x806799c: mov (%esp),%eax
0x806799f: or $0xc00,%eax
0x80679a4: mov %eax,0x4(%esp)
0x80679a8: fldcw 0x4(%esp)
0x80679ac: flds 0x806793c
0x80679b2: fsts 0x805f014
0x80679b8: fstps 0x8067954
0x80679be: fldcw (%esp)
0x80679c1: add $0x8,%esp
0x80679c7: emms
0x80679c9: leave
0x80679ca: ret
Edit: Like I said above, in the first instruction in this function, %ebp is void. This is also the instruction that causes the segmentation fault. Is that because it's void, or am I looking for something else?
Edit: Scratch that. I keep typing edp instead of ebp. Here are the values of ebp and esp.
(gdb) print $esp
$1 = (void *) 0xbffff14c
(gdb) print $ebp
$3 = (void *) 0xbffff168
Edit: Those values above are wrong. I should have used the 'x' command, like below:
(gdb) x/x $ebp
0xbffff168: 0xbffff188
(gdb) x/x $esp
0xbffff14c: 0x0804e481
Here's a reply from someone on a mailing list regarding this. Anyone care to illuminate what he means a bit? How do I check to see how the stack is set up?
An immediate problem I see is that the
stack pointer is not properly aligned.
This is 32-bit code, and the Intel
manual says that the stack should be
aligned at 32-bit addresses. That is,
the least significant digit in esp
should be 0, 4, 8, or c.
I also note that the values in ebp and
esp are very far apart. Typically,
they contain similar values --
addresses somewhere in the stack.
I would look at how the stack was set
up in this program.
He replied with corrections to the above comments. He was unable to see any problems after further input.
Another edit: Someone replied that the code page may not be marked executable. How can I insure it is marked as such?
The problem had nothing to do with the code. Adding -z execstack to the linker fixed the problem.
If push %ebp is causing a segfault, then your stack pointer isn't pointing at valid stack. How does control reach that point? What platform are you on, and is there anything odd about the runtime environment? At the entry to the function, %esp should point to the return address in the caller on the stack. Does it?
Aside from that, the whole function is pretty weird. You go out of your way to set the rounding bits in the fp control word, and then don't perform any operations that are affected by rounding. All the function does is copy some data, but uses floating-point registers to do it when you could use the integer registers just as well. And then there's the spurious emms, which you need after using MMX instructions, not after doing x87 computations.
Edit See Scott's (the original questioner) answer for the actual reason for the crash.

what is wrong with my version of _bittestandset

I am new to assembly language. It seems that gcc doesn't have _bittestandset function in intrin.h like MSVC does, so I implemented a new one. This one works fine in linux, but it goes wrong with mingw in winVista machine, the code is:
inline unsigned char _bittestandset(unsigned long * a, unsigned long b)
{
__asm__ ( "bts %1, %[b]"
: "=g"(*a)
: [b]"Ir"(b), "g"(*a) );
return 0;
}
Could you give some further explanation what's not working? Maybe a simple example of the usage or so. It's hard to guess what's wrong with the code..
One thing that looks cheesy so far: You execute a bit-test opcode but ignore the result. The bit that you test (and set) ends up in the carry flag after the opcode.
If you want to get the result you need an additional instruction to get the carry flag into some other output register. (SBB EAX, EAX or something like that).
Otherwise - if you don't need the result - it's much cheaper to replace the BTS instruction with three simpler assembler opcodes:
Something along these lines:
; bit-index in cl
; Value in ebx
; eax used as a temporary (trashed)
mov eax, 1
shl eax, cl
or ebx, eax
Since I don't have mingw with your exact version running a assembly-output of a simple testcase could give us some clue what's going wrong.
DON'T use the BTS instruction. It's horribly slow. A better way would be to implement it using simple shift+or. And as a bonus, you can do it without any assembly, in pure, portable C++.

Resources