unknown "ptr" variable in golang asm code - go

Recently I just begin to read the source code of atomic.LoadUint64, then I got an unknown variable "ptr" in the following asm code:
TEXT runtime∕internal∕atomic·Load64(SB), NOSPLIT, $0-12
MOVL ptr+0(FP), AX
TESTL $7, AX
JZ 2(PC)
MOVL 0, AX // crash with nil ptr deref
MOVQ (AX), M0
MOVQ M0, ret+4(FP)
EMMS
RET
I couldn't find the declaration of this variable, and any documents about this variable, can anyone tell me about it?

A Quick Guide to Go's Assembler
Symbols
The FP pseudo-register is a virtual frame pointer used to refer to
function arguments. The compilers maintain a virtual frame pointer and
refer to the arguments on the stack as offsets from that
pseudo-register. Thus 0(FP) is the first argument to the function,
8(FP) is the second (on a 64-bit machine), and so on. However, when
referring to a function argument this way, it is necessary to place a
name at the beginning, as in first_arg+0(FP) and second_arg+8(FP).
(The meaning of the offset—offset from the frame pointer—distinct from
its use with SB, where it is an offset from the symbol.) The
assembler enforces this convention, rejecting plain 0(FP) and
8(FP). The actual name is semantically irrelevant but should be used
to document the argument's name. It is worth stressing that FP is
always a pseudo-register, not a hardware register, even on
architectures with a hardware frame pointer.
ptr, in ptr+0(FP), is a name for the first argument to the function. The argument is probably a pointer.

Related

Position of GCC stack canaries

Unless I am misunderstanding something, it seems the position of the canary value can be before or after ebp, therefore in the second case the attacker can overwrite the frame pointer without touching the canary.
For example in this snippet, the canary is located at a lower address (ebp-0xc) than ebp therefore protecting it (an attacker must overwrite the canary to overwrite ebp):
0x080484e0 <+52>: mov eax,DWORD PTR [ebp-0xc]
0x080484e3 <+55>: xor eax,DWORD PTR gs:0x14
0x080484ea <+62>: je 0x80484f1 <func+69>
0x080484ec <+64>: call 0x8048360 <__stack_chk_fail#plt>
However looking at other code the canary is after rbp+8:
How should I interpret this? Does this depend on GCC version or something else?
The canary is always below the frame pointer, with every version of gcc I've tried. You can see that confirmed in the gdb disassembly immediately below the IDA disassembly in the blog post you linked, which has mov rax, QWORD PTR [rbp-0x8].
I think this is just an artifact of IDA's disassembler. Instead of displaying the numerical offset for rbp-relative addresses, it assigns a name to each stack slot, and displays the name instead; basically assuming that every rbp-relative access is to a local variable or argument. And it looks like it always displays that name with a + regardless of whether the offset is positive or negative. Note that buf and fd also get a + sign even though they are local variables which are clearly below the frame pointer.
In this example, it has named the canary var_8 as if it were a local variable. So I suppose to translate this properly, you have to think of var_8 as having the value -8.

Translating Go assembler to NASM

I came across the following Go code:
type Element [12]uint64
//go:noescape
func CSwap(x, y *Element, choice uint8)
//go:noescape
func Add(z, x, y *Element)
where the CSwap and Add functions are basically coming from an assembly, and look like the following:
TEXT ·CSwap(SB), NOSPLIT, $0-17
MOVQ x+0(FP), REG_P1
MOVQ y+8(FP), REG_P2
MOVB choice+16(FP), AL // AL = 0 or 1
MOVBLZX AL, AX // AX = 0 or 1
NEGQ AX // RAX = 0x00..00 or 0xff..ff
MOVQ (0*8)(REG_P1), BX
MOVQ (0*8)(REG_P2), CX
// Rest removed for brevity
TEXT ·Add(SB), NOSPLIT, $0-24
MOVQ z+0(FP), REG_P3
MOVQ x+8(FP), REG_P1
MOVQ y+16(FP), REG_P2
MOVQ (REG_P1), R8
MOVQ (8)(REG_P1), R9
MOVQ (16)(REG_P1), R10
MOVQ (24)(REG_P1), R11
// Rest removed for brevity
What I try to do is that translate the assembly to a syntax that is more familiar to me (I think mine is more like NASM), while the above syntax is Go assembler. Regarding the Add method I didn't have much problem, and translated it correctly (according to test results). It looks like this in my case:
.text
.global add_asm
add_asm:
push r12
push r13
push r14
push r15
mov r8, [reg_p1]
mov r9, [reg_p1+8]
mov r10, [reg_p1+16]
mov r11, [reg_p1+24]
// Rest removed for brevity
But, I have a problem when translating the CSwap function, I have something like this:
.text
.global cswap_asm
cswap_asm:
push r12
push r13
push r14
mov al, 16
mov rax, al
neg rax
mov rbx, [reg_p1+(0*8)]
mov rcx, [reg_p2+(0*8)]
But this doesn't seem to be quite correct, as I get error when compiling it. Any ideas how to translate the above CSwap assembly part to something like NASM?
EDIT (SOLUTION):
Okay, after the two answers below, and some testing and digging, I found out that the code uses the following three registers for parameter passing:
#define reg_p1 rdi
#define reg_p2 rsi
#define reg_p3 rdx
Accordingly, rdx has the value of the choice parameter. So, all that I had to do was use this:
movzx rax, dl // Get the lower 8 bits of rdx (reg_p3)
neg rax
Using byte [rdx] or byte [reg_3] was giving an error, but using dl seems to work fine for me.
Basic docs about Go's asm: https://golang.org/doc/asm. It's not totally equivalent to NASM or AT&T syntax: FP is a pseudo-register name for whichever register it decides to use as the frame pointer. (Typically RSP or RBP). Go asm also seems to omit function prologue (and probably epilogue) instructions. As #RossRidge comments, it's a bit more like a internal representation like LLVM IR than truly asm.
Go also has its own object-file format, so I'm not sure you can make Go-compatible object files with NASM.
If you want to call this function from something other than Go, you'll also need to port the code to a different calling convention. Go appears to be using a stack-args calling convention even for x86-64, unlike the normal x86-64 System V ABI or the x86-64 Windows calling convention. (Or maybe those mov function args into REG_P1 and so on instructions disappear when Go builds this source for a register-arg calling convention?)
(This is why you could you had to use movzx eax, dl instead of loading from the stack at all.)
BTW, rewriting this code in C instead of NASM would probably make even more sense if you want to use it with C. Small functions are best inlined and optimized away by the compiler.
It would be a good idea to check your translation, or get a starting point, by assembling with the Go assembler and using a disassembler.
objdump -drwC -Mintel or Agner Fog's objconv disassembler would be good, but they don't understand Go's object-file format. If Go has a tool to extract the actual machine code or get it in an ELF object file, do that.
If not, you could use ndisasm -b 64 (which treats input files as flat binaries, disassembling all the bytes as if they were instructions). You can specify an offset/length if you can find out where the function starts. x86 instructions are variable length, and disassembly will likely be "out of sync" at the start of the function. You might want to add a bunch of single-byte NOP instructions (kind of a NOP sled) for the disassembler, so if it decodes some 0x90 bytes as part of an immediate or disp32 for a long instruction that was really not part of the function, it will be in sync. (But the function prologue will still be messed up).
You might add some "signpost" instructions to your Go asm functions to make it easy to find the right place in the mess of crazy asm from disassembling metadata as instructions. e.g. put a pmuludq xmm0, xmm0 in there somewhere, or some other instruction with a unique mnemonic that you can search for which the Go code doesn't include. Or an instruction with an immediate that will stand out, like addq $0x1234567, SP. (An instruction that will crash so you don't forget to take it out again is good here.)
Or you could use gdb's built-in disassembler: add an instruction that will segfault (like a load from a bogus absolute address (movl 0, AX null-pointer deref), or a register holding a non-pointer value e.g. movl (AX), AX). Then you'll have an instruction-pointer value for the instructions in memory, and can disassemble from some point behind that. (Probably the function start will be 16-byte aligned.)
Specific instructions.
MOVBLZX AL, AX reads AL, so that's definitely an 8-bit operand. The size for AX is given by the L part of the mnemonic, meaning long for 32 bit, like in GAS AT&T syntax. (The gas mnemonic for that form of movzx is movzbl %al, %eax). See What does cltq do in assembly? for a table of cdq / cdqe and the AT&T equivalent, and the AT&T / Intel mnemonic for the equivalent MOVSX instruction.
The NASM instruction you want is movzx eax, al. Using rax as the destination would be a waste of a REX prefix. Using ax as the destination would be a mistake: it wouldn't zero-extend into the full register, and would leave whatever high garbage. Go asm syntax for x86 is very confusing when you're not used to it, because AX can mean AX, EAX, or RAX depending on the operand size.
Obviously mov rax, al isn't a possibility: Like most instructions, mov requires both its operands to be the same size. movzx is one of the rare exceptions.
MOVB choice+16(FP), AL is a byte load into AL, not an immediate move. choice+16 is a an offset from FP. This syntax is basically the same as AT&T addressing modes, with FP as a register and choice as an assemble-time constant.
FP is a pseudo-register name. It's pretty clear that it should simply be loading the low byte of the 3rd arg-passing slot, because choice is the name of a function arg. (In Go asm, choice is just syntactic sugar, or a constant defined as zero.)
Before a call instruction, rsp points at the first stack arg, so that + 16 is the 3rd arg. It appears that FP is that base address (and might actually be rsp+8 or something). After a call (which pushes an 8 byte return address), the 3rd stack arg is at rsp + 24. After more pushes, the offset will be even larger, so adjust as necessary to reach the right location.
If you're porting this function to be called with a standard calling convention, the 3 integer args will be passed in registers, with no stack args. Which 3 registers depends on whether you're building for Windows vs. non-Windows. (See Agner Fog's calling conventions doc: http://agner.org/optimize/)
BTW, a byte load into AL and then movzx eax, al is just dumb. Much more efficient on all modern CPUs to do it in one step with
movzx eax, byte [rsp + 24] ; or rbp+32 if you made a stack frame.
I hope the source in the question is from un-optimized Go compiler output? Or the assembler itself makes such optimizations?
I think you can translate these as just
mov rbx, [reg_p1]
mov rcx, [reg_p2]
Unless I'm missing some subtlety, the offsets which are zero can just be ignored. The *8 isn't a size hint since that's already in the instruction.
The rest of your code looks wrong though. The MOVB choice+16(FP), AL in the original is supposed to be fetching the choice argument into AL, but you're setting AL to a constant 16, and the code for loading the other arguments seems to be completely missing, as is the code for all of the arguments in the other function.

The different between with GCC inline assembly and VC

I am migrate VC inline assembly code into GCC inline assembly code.
#ifdef _MSC_VER
// this is raw code.
__asm
{
cmp cx, 0x2e
mov dword ptr ds:[esi*4+0x57F0F0], edi
jmp BW::BWFXN_RefundMin4ReturnAddress
}
#else
// this is my code.
asm
(
"cmp $0x2e, %%cx\n\t"
"movl %%edi, $ds:0x57F0F0(0, %%esi, 4)\n\t"
"jmp %0"
: /* no output */
: "i"(BW::BWFXN_RefundGas3ReturnAddress)
: "cx"
);
#endif
But I got error
Error:junk `:0x57F0F0(0,%esi,4)' after expression
/var/.../cckr7pkp.s:3034: Error:operand type mismatch for `mov'
/var/.../cckr7pkp.s:3035: Error:operand type mismatch for `jmp'
Refer the Address operand syntax
segment:displacement(base register, offset register, scalar multiplier)
is equivalent to
segment:[base register + displacement + offset register * scalar multiplier]
in Intel syntax.
I don't know where is the issue.
This is highly unlikely to work just from getting the syntax correct, because you're depending on values in registers being set to something before the asm statement, and you aren't using any input operands to make that happen. (And for some reason you need to set flags with cmp before jumping?)
If that fragment worked on its own somehow in MSVC, then your code depends on the choices made by MSVC's optimizer (as far as which C value is in which register), which seems insane.
Anyway, the first answer to any inline asm question is https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Now might be a good time to rewrite your thing in C (maybe with some __builtin functions if needed).
You should use asm volatile and a "memory" clobber at the very least. The compiler assumes that execution continues after the asm statement, but at least this will make sure it stores everything to memory before the asm, i.e. it's a full memory barrier (against compile-time reordering). But any destructors at the end of the function (or in the callers) won't run, and no stack cleanup will happen; there's really no way to make this safe.
You might be able to use asm goto, but that might only work for labels within the same function.
As far as syntax goes, leave out %%ds: because it's the default segment anyway. (Everything after $ds was considered junk because $ds is the address of the symbol ds. Register names start with %.) Also, just leave out the base entirely, instead of using zero. Use
"movl %%edi, 0x57F0F0( ,%%esi, 4) \n\t"
You could have got a disassembler to tell you how to write that, by assembling the Intel version and disassembling in AT&T syntax.
You can probably implement that store in pure C easily enough, e.g. int32_t *p = (int32_t *)0x57F0F0; p[foo]=bar;.
For the jmp operand, use %c0 to get the address with no $ so the compiler's asm output is jmp 0x12345 instead of jmp $0x12345. See also https://stackoverflow.com/tags/inline-assembly/info for more links to guides + docs.
You can and should look at the gcc -O2 -S output to see what the compiler is feeding to the assembler. i.e. how exactly it's filling in the asm template.
I tested this on Godbolt to make sure it compiles, and see the asm output + disassembly output
void ext(void);
long foo(int a, int b) { return 0; }
static const unsigned my_addr = 0x000045678;
//__attribute__((noinline))
void testasm(void)
{
asm volatile( // and still not safe in general
"movl %%edi, 0x57F0F0( ,%%esi, 4) \n\t"
"jmp %c[foo] \n\t"
"jmp foo \n\t"
"jmp 0x12345 \n\t"
"jmp %c[addr] "
: // no outputs
: // "S" (value_for_esi), "D" (value_for_edi)
[foo] "i" (foo),
[addr] "i" (my_addr)
: "memory" // prevents stores from sinking past this
);
// make sure gcc doesn't need to call any destructors here
// or in our caller
// because jumping away will mean they don't run
}
Notice that an "i" (foo) constraint and a %c[operand] (or %c0) will produce a jmp foo in the asm output, so you can emit a direct jmp by pretending you're using a function pointer.
This also works for absolute addresses. x86 machine code can't encode a direct jump, but GAS asm syntax will let you write a jump target as an absolute numeric address. The linker will fill in the right rel32 offset to reach the absolute address from wherever the jmp ends up.
So your inline asm template just needs to produce jmp 0x12345 as input to the assembler to get a direct jump.
asm output for testasm:
movl %edi, 0x57F0F0( ,%esi, 4)
jmp foo #
jmp foo
jmp 0x12345
jmp 284280 # constant substituted by the compiler from a static const unsigned C variable
ret
disassembly output:
mov %edi,0x57f0f0(,%esi,4)
jmp 80483f0 <foo>
jmp 80483f0 <foo>
jmp 12345 <_init-0x8035f57>
jmp 45678 <_init-0x8002c24>
ret
Note that the jump targets decoded to absolute addresses in hex. (Godbolt doesn't give easy access to copy/paste the raw machine code, but you can see it on mouseover of the left column.)
This only works in position-dependent code (not PIC), otherwise absolute relocations aren't possible. Note that many recent Linux distros ship gcc set to use -pie by default to enable ASLR for 64-bit executables, so you may need -no-pie -fno-pie to make this work, or else ask for the address in a register (r constraint and jmp *%[addr]) to actually do an indirect jump to an absolute address instead of a relative jump.

gcc, __atomic_exchange seems to produce non-atomic asm, why?

I am working on a nice tool, which requires the atomic swap of two different 64-bit values. On the amd64 architecture it is possible with the XCHGQ instruction (see here in doc, warning: it is a long pdf).
Correspondingly, gcc has some atomic builtins which would ideally do the same, as it is visible for example here.
Using these 2 docs I produced the following simple C function, for the atomic swapping of two, 64-bit values:
void theExchange(u64* a, u64* b) {
__atomic_exchange(a, b, b, __ATOMIC_SEQ_CST);
};
(Btw, it wasn't really clear to me, why needs an "atomic exchange" 3 operands.)
It was to me a little bit fishy, that the gcc __atomic_exchange macro uses 3 operands, so I tested its asm output. I compiled this with a gcc -O6 -masm=intel -S and I've got the following output:
.LHOTB0:
.p2align 4,,15
.globl theExchange
.type theExchange, #function
theExchange:
.LFB16:
.cfi_startproc
mov rax, QWORD PTR [rsi]
xchg rax, QWORD PTR [rdi] /* WTF? */
mov QWORD PTR [rsi], rax
ret
.cfi_endproc
.LFE16:
.size theExchange, .-theExchange
.section .text.unlikely
As we can see, the result function contains not only a single data move, but three different data movements. Thus, as I understood this asm code, this function won't be really atomic.
How is it possible? Maybe I misunderstood some of the docs? I admit, the gcc builtin doc wasn't really clear to me.
This is the generic version of __atomic_exchange_n (type *ptr, type val, int memorder) where only the exchange operation on ptr is atomic, the reading of val is not. In the generic version, val is accessed via pointer, but the atomicity still does not apply to it. The pointer is so that it will work with multiple sizes, when the compiler has to call an external helper:
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in
function if the specific data type size makes that possible;
otherwise, an external call is left to be resolved at run time. This
external call is the same format with the addition of a ‘size_t’
parameter inserted as the first parameter indicating the size of the
object being pointed to. All objects must be the same size.

cedecl calling convention -- compiled asm instructions cause crash

Treat this more as pseudocode than anything. If there's some macro or other element that you feel should be included, let me know.
I'm rather new to assembly. I programmed on a pic processor back in college, but nothing since.
The problem here (segmentation fault) is the first instruction after "Compile function entrance, setup stack frame." or "push %ebp". Here's what I found out about those two instructions:
http://unixwiz.net/techtips/win32-callconv-asm.html
Save and update the %ebp :
Now that we're in the new function, we need a new local stack frame pointed to by %ebp, so this is done by saving the current %ebp (which belongs to the previous function's frame) and making it point to the top of the stack.
push ebp
mov ebp, esp // ebp « esp
Once %ebp has been changed, it can now refer directly to the function's arguments as 8(%ebp), 12(%ebp). Note that 0(%ebp) is the old base pointer and 4(%ebp) is the old instruction pointer.
Here's the code. This is from a JIT compiler for a project I'm working on. I'm doing this more for the learning experience than anything.
IL_CORE_COMPILE(avs_x86_compiler_compile)
{
X86GlobalData *gd = X86_GLOBALDATA(ctx);
ILInstruction *insn;
avs_debug(print("X86: Compiling started..."));
/* Initialize X86 Assembler opcode context */
x86_context_init(&gd->ctx, 4096, 1024*1024);
/* Compile function entrance, setup stack frame*/
x86_emit1(&gd->ctx, pushl, ebp);
x86_emit2(&gd->ctx, movl, esp, ebp);
/* Setup floating point rounding mode to integer truncation */
x86_emit2(&gd->ctx, subl, imm(8), esp);
x86_emit1(&gd->ctx, fstcw, disp(0, esp));
x86_emit2(&gd->ctx, movl, disp(0, esp), eax);
x86_emit2(&gd->ctx, orl, imm(0xc00), eax);
x86_emit2(&gd->ctx, movl, eax, disp(4, esp));
x86_emit1(&gd->ctx, fldcw, disp(4, esp));
for (insn=avs_il_tree_base(tree); insn != NULL; insn = insn->next) {
avs_debug(print("X86: Compiling instruction: %p", insn));
compile_opcode(gd, obj, insn);
}
/* Restore floating point rounding mode */
x86_emit1(&gd->ctx, fldcw, disp(0, esp));
x86_emit2(&gd->ctx, addl, imm(8), esp);
/* Cleanup stack frame */
x86_emit0(&gd->ctx, emms);
x86_emit0(&gd->ctx, leave);
x86_emit0(&gd->ctx, ret);
/* Link machine */
obj->run = (AvsRunnableExecuteCall) gd->ctx.buf;
return 0;
}
And when obj->run is called, it's called with obj as its only argument:
obj->run(obj);
If it helps, here are the instructions for the entire function call. It's basically an assignment operation: foo=3*0.2;. foo is pointing to a float in C.
0x8067990: push %ebp
0x8067991: mov %esp,%ebp
0x8067993: sub $0x8,%esp
0x8067999: fnstcw (%esp)
0x806799c: mov (%esp),%eax
0x806799f: or $0xc00,%eax
0x80679a4: mov %eax,0x4(%esp)
0x80679a8: fldcw 0x4(%esp)
0x80679ac: flds 0x806793c
0x80679b2: fsts 0x805f014
0x80679b8: fstps 0x8067954
0x80679be: fldcw (%esp)
0x80679c1: add $0x8,%esp
0x80679c7: emms
0x80679c9: leave
0x80679ca: ret
Edit: Like I said above, in the first instruction in this function, %ebp is void. This is also the instruction that causes the segmentation fault. Is that because it's void, or am I looking for something else?
Edit: Scratch that. I keep typing edp instead of ebp. Here are the values of ebp and esp.
(gdb) print $esp
$1 = (void *) 0xbffff14c
(gdb) print $ebp
$3 = (void *) 0xbffff168
Edit: Those values above are wrong. I should have used the 'x' command, like below:
(gdb) x/x $ebp
0xbffff168: 0xbffff188
(gdb) x/x $esp
0xbffff14c: 0x0804e481
Here's a reply from someone on a mailing list regarding this. Anyone care to illuminate what he means a bit? How do I check to see how the stack is set up?
An immediate problem I see is that the
stack pointer is not properly aligned.
This is 32-bit code, and the Intel
manual says that the stack should be
aligned at 32-bit addresses. That is,
the least significant digit in esp
should be 0, 4, 8, or c.
I also note that the values in ebp and
esp are very far apart. Typically,
they contain similar values --
addresses somewhere in the stack.
I would look at how the stack was set
up in this program.
He replied with corrections to the above comments. He was unable to see any problems after further input.
Another edit: Someone replied that the code page may not be marked executable. How can I insure it is marked as such?
The problem had nothing to do with the code. Adding -z execstack to the linker fixed the problem.
If push %ebp is causing a segfault, then your stack pointer isn't pointing at valid stack. How does control reach that point? What platform are you on, and is there anything odd about the runtime environment? At the entry to the function, %esp should point to the return address in the caller on the stack. Does it?
Aside from that, the whole function is pretty weird. You go out of your way to set the rounding bits in the fp control word, and then don't perform any operations that are affected by rounding. All the function does is copy some data, but uses floating-point registers to do it when you could use the integer registers just as well. And then there's the spurious emms, which you need after using MMX instructions, not after doing x87 computations.
Edit See Scott's (the original questioner) answer for the actual reason for the crash.

Resources