Call function in dynamic library in assembly? - macos

I'm probably way off but this is what I did, I'm also trying to get this to work from linux to cross compile to mac
I did a hello world kind of thing in C with write, malloc and realloc. I notice in the assembly it used adrp but I couldn't figure out how to use that instruction. I kept getting a label must be GOT relative error. I was hoping I could use the section as the label but ended up writing a label which didn't help.
Essentially the write c stub function uses adrp, then ldr [x16, #24]. Since I couldn't figure out adrp I used mov and movk. It seemed to do the same thing but I got a segment fault when I execute it. Stepping through lldb it appears that the code did what I thought however the GOT section wasn't replaced at runtime like I thought it would. Objdump leaves me to believe I named the section right. I don't know if figuring out adrp is all I need to get this to work or if I did everything completely wrong
.global _main
.align 2
_main:
mov X0, #1
adr X1, hello
mov X2, #13
mov X16, #4
svc 0
mov x16, 16384
movk x16, 0x1, lsl 32
ldr x16, [x16, #24]
#adrp x16, HowGOTLabel
#ldr x16, [x16, #24]
br x16
mov X0, #0
mov X16, #1
svc 0
hello: .ascii "Hello\n"
.section __DATA_CONST,__got
.align 3
HowGOTLabel:
.word 0
.word 0x80100000
.word 1
.word 0x80100000
.word 2
.word 0x80100000
.word 3
.word 0x80000000

Darwin on arm64 forces all userland binaries to use ASLR, so you cannot use movz/movk for PC-relative addresses.
The reason why your adrp doesn't work is because it can only refer to 0x1000-byte aligned locations. For more granular targeting you'd use adr, but there you have the issue of being limited to ±1MB of the instruction. For Linux targets, the compiler seems to be more lenient here, but for Darwin targets, adr can only really be used for locations within the same section, and you're trying to refer to __DATA_CONST.__got from __TEXT.__text.
So how can you fix this? You use #PAGE and #PAGEOFF:
adrp x16, HowGOTLabel#PAGE
add x16, x16, HowGOTLabel#PAGEOFF
You can even have this be fixed up to adr+nop at link-time if the target is in range, with some asm directives:
Lloh0:
adrp x16, HowGOTLabel#PAGE
Lloh1:
add x16, x16, HowGOTLabel#PAGEOFF
.loh AdrpAdd Lloh0, Lloh1
You can also do this with AdrpLdr if the second instruction is ldr rather than add.
But once you fix that, you've got two other issues in your code:
You use br x16. This means you won't return to the callsite. Use blr for function calls.
You don't actually have any imports? It's not clear how you think this would end up calling library functions, but really you can just do it like this:
bl _printf
And the compiler and linker will take care of imports.

Related

Translating Go assembler to NASM

I came across the following Go code:
type Element [12]uint64
//go:noescape
func CSwap(x, y *Element, choice uint8)
//go:noescape
func Add(z, x, y *Element)
where the CSwap and Add functions are basically coming from an assembly, and look like the following:
TEXT ·CSwap(SB), NOSPLIT, $0-17
MOVQ x+0(FP), REG_P1
MOVQ y+8(FP), REG_P2
MOVB choice+16(FP), AL // AL = 0 or 1
MOVBLZX AL, AX // AX = 0 or 1
NEGQ AX // RAX = 0x00..00 or 0xff..ff
MOVQ (0*8)(REG_P1), BX
MOVQ (0*8)(REG_P2), CX
// Rest removed for brevity
TEXT ·Add(SB), NOSPLIT, $0-24
MOVQ z+0(FP), REG_P3
MOVQ x+8(FP), REG_P1
MOVQ y+16(FP), REG_P2
MOVQ (REG_P1), R8
MOVQ (8)(REG_P1), R9
MOVQ (16)(REG_P1), R10
MOVQ (24)(REG_P1), R11
// Rest removed for brevity
What I try to do is that translate the assembly to a syntax that is more familiar to me (I think mine is more like NASM), while the above syntax is Go assembler. Regarding the Add method I didn't have much problem, and translated it correctly (according to test results). It looks like this in my case:
.text
.global add_asm
add_asm:
push r12
push r13
push r14
push r15
mov r8, [reg_p1]
mov r9, [reg_p1+8]
mov r10, [reg_p1+16]
mov r11, [reg_p1+24]
// Rest removed for brevity
But, I have a problem when translating the CSwap function, I have something like this:
.text
.global cswap_asm
cswap_asm:
push r12
push r13
push r14
mov al, 16
mov rax, al
neg rax
mov rbx, [reg_p1+(0*8)]
mov rcx, [reg_p2+(0*8)]
But this doesn't seem to be quite correct, as I get error when compiling it. Any ideas how to translate the above CSwap assembly part to something like NASM?
EDIT (SOLUTION):
Okay, after the two answers below, and some testing and digging, I found out that the code uses the following three registers for parameter passing:
#define reg_p1 rdi
#define reg_p2 rsi
#define reg_p3 rdx
Accordingly, rdx has the value of the choice parameter. So, all that I had to do was use this:
movzx rax, dl // Get the lower 8 bits of rdx (reg_p3)
neg rax
Using byte [rdx] or byte [reg_3] was giving an error, but using dl seems to work fine for me.
Basic docs about Go's asm: https://golang.org/doc/asm. It's not totally equivalent to NASM or AT&T syntax: FP is a pseudo-register name for whichever register it decides to use as the frame pointer. (Typically RSP or RBP). Go asm also seems to omit function prologue (and probably epilogue) instructions. As #RossRidge comments, it's a bit more like a internal representation like LLVM IR than truly asm.
Go also has its own object-file format, so I'm not sure you can make Go-compatible object files with NASM.
If you want to call this function from something other than Go, you'll also need to port the code to a different calling convention. Go appears to be using a stack-args calling convention even for x86-64, unlike the normal x86-64 System V ABI or the x86-64 Windows calling convention. (Or maybe those mov function args into REG_P1 and so on instructions disappear when Go builds this source for a register-arg calling convention?)
(This is why you could you had to use movzx eax, dl instead of loading from the stack at all.)
BTW, rewriting this code in C instead of NASM would probably make even more sense if you want to use it with C. Small functions are best inlined and optimized away by the compiler.
It would be a good idea to check your translation, or get a starting point, by assembling with the Go assembler and using a disassembler.
objdump -drwC -Mintel or Agner Fog's objconv disassembler would be good, but they don't understand Go's object-file format. If Go has a tool to extract the actual machine code or get it in an ELF object file, do that.
If not, you could use ndisasm -b 64 (which treats input files as flat binaries, disassembling all the bytes as if they were instructions). You can specify an offset/length if you can find out where the function starts. x86 instructions are variable length, and disassembly will likely be "out of sync" at the start of the function. You might want to add a bunch of single-byte NOP instructions (kind of a NOP sled) for the disassembler, so if it decodes some 0x90 bytes as part of an immediate or disp32 for a long instruction that was really not part of the function, it will be in sync. (But the function prologue will still be messed up).
You might add some "signpost" instructions to your Go asm functions to make it easy to find the right place in the mess of crazy asm from disassembling metadata as instructions. e.g. put a pmuludq xmm0, xmm0 in there somewhere, or some other instruction with a unique mnemonic that you can search for which the Go code doesn't include. Or an instruction with an immediate that will stand out, like addq $0x1234567, SP. (An instruction that will crash so you don't forget to take it out again is good here.)
Or you could use gdb's built-in disassembler: add an instruction that will segfault (like a load from a bogus absolute address (movl 0, AX null-pointer deref), or a register holding a non-pointer value e.g. movl (AX), AX). Then you'll have an instruction-pointer value for the instructions in memory, and can disassemble from some point behind that. (Probably the function start will be 16-byte aligned.)
Specific instructions.
MOVBLZX AL, AX reads AL, so that's definitely an 8-bit operand. The size for AX is given by the L part of the mnemonic, meaning long for 32 bit, like in GAS AT&T syntax. (The gas mnemonic for that form of movzx is movzbl %al, %eax). See What does cltq do in assembly? for a table of cdq / cdqe and the AT&T equivalent, and the AT&T / Intel mnemonic for the equivalent MOVSX instruction.
The NASM instruction you want is movzx eax, al. Using rax as the destination would be a waste of a REX prefix. Using ax as the destination would be a mistake: it wouldn't zero-extend into the full register, and would leave whatever high garbage. Go asm syntax for x86 is very confusing when you're not used to it, because AX can mean AX, EAX, or RAX depending on the operand size.
Obviously mov rax, al isn't a possibility: Like most instructions, mov requires both its operands to be the same size. movzx is one of the rare exceptions.
MOVB choice+16(FP), AL is a byte load into AL, not an immediate move. choice+16 is a an offset from FP. This syntax is basically the same as AT&T addressing modes, with FP as a register and choice as an assemble-time constant.
FP is a pseudo-register name. It's pretty clear that it should simply be loading the low byte of the 3rd arg-passing slot, because choice is the name of a function arg. (In Go asm, choice is just syntactic sugar, or a constant defined as zero.)
Before a call instruction, rsp points at the first stack arg, so that + 16 is the 3rd arg. It appears that FP is that base address (and might actually be rsp+8 or something). After a call (which pushes an 8 byte return address), the 3rd stack arg is at rsp + 24. After more pushes, the offset will be even larger, so adjust as necessary to reach the right location.
If you're porting this function to be called with a standard calling convention, the 3 integer args will be passed in registers, with no stack args. Which 3 registers depends on whether you're building for Windows vs. non-Windows. (See Agner Fog's calling conventions doc: http://agner.org/optimize/)
BTW, a byte load into AL and then movzx eax, al is just dumb. Much more efficient on all modern CPUs to do it in one step with
movzx eax, byte [rsp + 24] ; or rbp+32 if you made a stack frame.
I hope the source in the question is from un-optimized Go compiler output? Or the assembler itself makes such optimizations?
I think you can translate these as just
mov rbx, [reg_p1]
mov rcx, [reg_p2]
Unless I'm missing some subtlety, the offsets which are zero can just be ignored. The *8 isn't a size hint since that's already in the instruction.
The rest of your code looks wrong though. The MOVB choice+16(FP), AL in the original is supposed to be fetching the choice argument into AL, but you're setting AL to a constant 16, and the code for loading the other arguments seems to be completely missing, as is the code for all of the arguments in the other function.

Stack problems when calling printf from an ARM assembly function

I have an ARM assembly function that is called from a C function.
At some point, I do something like this:
.syntax unified
.arm
.text
.globl myfunc
.extern printf
myfunc:
stmdb sp!, {r4-r11} // save stack from C call
... do stuff ...
// (NOT SHOWN): Load values into r1 and r2 to be printed by format string above
ldr r0, =message // Load format string above
push {lr} // me attempting to preserve my stack
bl printf // actual call to printf
pop {lr} // me attempting to recover my stack
ldmia sp!, {r4-r11} // recover stack from C call
mov r0, r2 // Move return value into r0
mov pc, lr // Return to C
.section data
message:
.asciz "Output: %d, %d\n"
.end
This runs sometimes, crashes sometimes, runs a few times then crashes, etc. It actually runs on a quasi bare-metal context, so I can't run a debugger. I'm 99% sure it's a stack -- or alignment? -- thing, as per this Printf Change values in registers, ARM Assembly and this Call C function from Assembly -- the application freezes at "call printf" and I have no idea why.
Can anyone provide some specific ideas for how to get the above chunk of code running, and perhaps general ideas for best practices here? Ideally I'd like to be able to call the same output function multiple times in my assembly file, to debug things as I go.
Thanks in advance!
I could see the following issues in that code:
.align 2 (could be 3 or any higher value) before function entry point (myfunc:)
.align 2 // guarantee that instruction address is 4B aligned
myfunc:
as was mentioned in comments, stack is expected to be 8B aligned. push {lr} breaks that.
message: doesn't need to be in 'data' section. It might be placed in code section behind 'myfunc'. Check linker map that data is actually present & address loaded into r0 is correct.
Since that a bare-metal, check that stack is set properly and enough room is reserved for it.

How to make gcc compiler reserve registers when building intel-style inline assembly code?

I am building some intel-style inline assembly code using gcc compiler on Xcode 4.
Below lists part of the inline assembly code:
_asm
{
mov eax, esp
sub esp, 116
and esp, ~15
mov [esp+112], eax
}
Under ship mode, GCC compiles the above 4 lines asm code to:
mov %esp,%eax
sub $0x74,%esp
and $0xfffffff0,%esp
mov %eax,0x70(%esp)
which are exactly what I want.
However, under debug mode GCC will compiler that code to
mov %esp,%eax
mov %eax,%esp
mov %esp,%eax
mov %eax,-0x28(%ebp)
mov %esp,%eax
mov %eax,%esp
sub $0x74,%esp
mov %esp,%eax
mov %eax,-0x24(%ebp)
mov %esp,%eax
mov %eax,%esp
**and $0xfffffff0,%esp**
**mov %esp,%eax** **//changing the value of “eax”**
mov %eax,-0x24(%ebp)
mov %esp,%ecx
mov %ecx,%esp
**mov %eax,0x70(%esp)** **//store a “dirty” value to address 0x70(%esp), which is not we want**
One way to solve the above problem is to rewrite the inline asm code using AT&T style instructions and add the register to the clobbered list. But this way would be a very time-consuming work since the code to rewrite is so…o long.
Are there any other efficient ways to solve the problem? To make the gcc compiler know that register “eax” should be reserved?
There are 2 ways:
The best way to solve it is using gcc assembly template
capabilities. Then you can tell the compiler WHAT you're doing an
the register allocator will not use your registers for anything
else.
A quickhack would be to just use "asm volatile" instead of "asm" that way gcc will not reschedule
any instructions inside that block. You'll still have to tell GCC
that you're using the register so it's not going to store anything
in there. You should also list "memory" in the clobber list, so gcc
knows that it can't trust values it might have loaded before your
code-block.
asm volatile(
"Code goes here"
: : : "eax", "esp", "memory"
);
Btw: Your code is doing some "bad things" like moving esp around, which might cause trouble down the line, unless you know exactly what you're doing.
An empty asm block after the intel-style block solves the problem, like this:
__asm volatile {
mov eax, esp
sub esp, 116
and esp, ~15
mov [esp+112], eax
};
__asm__ __volatile__ ("":::"eax", "memory");
However, if you don't restore %esp, it's going to wreak havoc.

ARM/Thumb code for firmware patches...How to tell gcc assembler / linker to BL to absolute addr?

I'm trying to write a firmware mod (to existing firmware, for which i don't have source code)
All Thumb code.
does anybody have any idea how to do this, in gcc as (GAS) assembler:
Use BL without having to manually calculate offsets, when BL'ing to some existing function (not in my code.. but i know its address)
Currently, if i want to use BL ...i have to :
-go back in my code
-figure out and add all the bytes that would result from assembling all the previous instructions in the function i'm writing
-add the begining address of my function to that (i specify the starting address of what i'm writing, in the linker script)
-and then substract the address of the firmfunc function i want to call
All this... just to calculate the offset... to be able to write abl offset... to call an existing firmware function?
And if i change any code before that BL, i have to do it all over again manually !
See.. this is why i want to learn to use BX right... instead of BL
Also, i don't quite understand the BX. If i use BX to jump to an absolute address, do i have to increase the actual address by 1, when caling Thumb code from Thumb code (to keep the lsb byte 1)... and the CPU will know it's thumb code ?
BIG EDIT:
Changing the answer based on what I have learned recently and a better understanding of the question
First off I dont know how to tell the linker to generate a bl to an address that is a hardcoded address and not actually in this code. You might try to rig up an elf file that has labels and such but dummy or no code, dont know if that will fool the linker or not. You would have to modify the linker script as well. not worth it.
your other question that was spawned from this one:
Arm/Thumb: using BX in Thumb code, to call a Thumb function, or to jump to a Thumb instruction in another function
For branching this works just fine:
LDR R6, =0x24000
ADD R6, #1 # (set lsb to 1)
BX R6
or save an instruction and just do this
LDR R6, =0x24001
BX R6
if you want to branch link and you know the address and you are in thumb mode and want to get to thumb code then
ldr r6,=0x24001
bl thumb_trampoline
;#returns here
...
.thumb_func
thumb_trampoline:
bx r6
And almost the exact same if you are starting in arm mode, and want to get to thumb code at an address you already know.
ldr r6,=0x24001
bl arm_trampoline
;#returns here
...
arm_trampoline:
bx r6
You have to know that you can trash r6 in this way (make sure r6 isnt saving some value being used by some code that called this code).
Very sorry misleading you with the other answer, I could swear that mov lr,pc pulled in the lsbit as a mode, but it doesnt.
The accepted answer achieves the desired goal, but to address the answer exactly as asked you can use the .equ directive to associate a constant vale with a symbol, that can then be used as an operand to instructions. This has the assembler synthesise the trampoline if/when necessary:
equ myFirmwareFunction, 0x12346570
.globl _start
mov r0, #42
b myFirmwareFunction
Which generates the following assembly[1]
01000000 <_start>:
1000000: e3a0002a mov r0, #42 ; 0x2a
1000004: eaffffff b 1000008 <__*ABS*0x12346570_veneer>
01000008 <__*ABS*0x12346570_veneer>:
__*ABS*0x12346570_veneer():
1000008: e51ff004 ldr pc, [pc, #-4] ; 100000c <__*ABS*0x12346570_veneer+0x4>
100000c: 12346570 data: #0x12345670
If the immediate value is close enough to PC that the offset will fit in the immediate field, then the verneer (trampoline) is skipped and you will get a single branch instruction to the specified constant address.
[1] using the codesorcery (2009q1) toolchain with:
arm-none-eabi-gcc -march=armv7-a -x assembler test.spp -o test.elf -Ttext=0x1000000 -nostdlib

How to write assembly language hello world program for 64 bit Mac OS X using printf?

I am trying to learn writing assembly language for 64 bit Mac OS. I have no problem with 32 bit Mac OS and both 32 bit and 64 bit Linux.
However, Mac OS 64 bit is different and I couldn't figure out. Therefore I am here to ask for help.
I have not problem using system call to print. However, I would like to learn how to call C functions using 64 bit assembly language of Mac OS.
Please look at the following code
.data
_hello:
.asciz "Hello, world\n"
.text
.globl _main
_main:
movq $0, %rax
movq _hello(%rip), %rdi
call _printf
I use
$ gcc -arch x86_64 hello.s
to assemble and link.
It generates binary code. However, I got a segmentation fault when running it.
I tried adding "subq $8, %rsp" before calling _printf, still the same result as before.
What did I do wrong?
By the way, is that any way to debug this code on Mac? I tried adding -ggdb or -gstab or -gDWARF, and
$gdb ./a.out, and can't see the code and set break points.
You didn't say exactly what the problem you're seeing is, but I'm guessing that you're crashing at the point of the call to printf. This is because OS X (both 32- and 64-bit) requires that the stack pointer have 16-byte alignment at the point of any external function call.
The stack pointer was 16-byte aligned when _main was called; that call pushed an eight-byte return address onto the stack, so the stack is not 16-byte aligned at the point of the call to _printf. Subtract eight from %rsp before making the call in order to properly align it.
So I went ahead and debugged this for you (no magic involved, just use gdb, break main, display/5i $pc, stepi, etc). The other problem you're having is here:
movq _hello(%rip), %rdi
This loads the first eight bytes of your string into %rdi, which isn't what you want at all (in particular, the first eight bytes of your string are exceedingly unlikely to constitute a valid pointer to a format string, which results in a crash in printf). Instead, you want to load the address of the string. A debugged version of your program is:
.cstring
_hello: .asciz "Hello, world\n"
.text
.globl _main
_main:
sub $8, %rsp // align rsp to 16B boundary
mov $0, %rax
lea _hello(%rip), %rdi // load address of format string
call _printf // call printf
add $8, %rsp // restore rsp
ret

Resources