Trying to capture two characters and new line from user input.
The program prints the 3 helloworlds to screen and then users can type in some characters.
Everything seems to work, but it doesn't print the input
I suspect it is due to the way I operate on the X1 register in the _read function, or the way the buffer is allocated
No errors are reported when running the code.
The code is compiled using the following command. It should run on a Mac M1
as HelloWorld.s -o HelloWorld.o && ld -macosx_version_min 12.0.0 -o HelloWorld HelloWorld.o -lSystem -syslibroot `xcrun -sdk macosx --show-sdk-path` -e _start -arch arm64 && ./HelloWorld
//HelloWorld.s
.equ SYS_WRITE, 4
.equ SYS_READ, 3
.equ NEWLN, 10
.global _start // Provide program starting address to linker
.align 2
// Setup the parameters to print hello world
// and then call Linux to do it.
_start:
adr X4, helloworld1
mov X1, X4
bl _sizeof
bl _print
adr X4, helloworld2
mov X1, X4
bl _sizeof
bl _print
adr X4, helloworld3
mov X1, X4
bl _sizeof
bl _print
bl _read
//mov X2, 4
// bl _sizeof
bl _print
_exit:
mov X0, X2 // Use 0 return code
mov X16, #1 // Service command code 1 terminates this program
svc 0 // Call MacOS to terminate the program
_sizeof: //X1 = address, X2 = out length, string must terminate with \n
str LR, [SP, #-16]! //Store registers
//str W0, [SP, #-16]!
mov X2, #0
__loop:
ldrb W0, [X1, X2] //load a byte into W0 (32 bit)
add X2, X2, #1 //Add 1 offset
cmp W0, NEWLN //Compare byte with \n return
bne __loop
//ldr W0, [SP], #16
ldr LR, [SP], #16 //Load registers
ret
_print: //X2 = length, X1 = address
str LR, [SP, #-16]! //Store registers
mov X0, #1 // 1 = StdOut
// mov X1, X1 // string to print
// mov X2, X2 // length of string
mov X16, SYS_WRITE // MacOS write system call
svc 0 // Call kernel to output the string
ldr LR, [SP], #16 //Load registers
ret
_read:
//3 AUE_NULL ALL { user_ssize_t read(int fd, user_addr_t cbuf, user_size_t nbyte); }
str LR, [SP, #-16]! //Store registers
adr X1, msg
mov X0, #0 // 0 = StdIn
ldr X1, [x1] // address to store string
mov X2, #4 // length
mov X16, SYS_READ // MacOS read system call
svc 0 // Call system
ldr LR, [SP], #16 //Load registers
ret
msg: .ds 4 //memory buffer for keyboard input
helloworld1: .ascii "Hello World\n"
helloworld2: .ascii "Happy new year for 2022\n"
helloworld3: .ascii "Welcome to AARCH64 assembly on Mac Silicon\n"
First you need to move msg to a writeable segment:
.data
msg: .ds 4 //memory buffer for keyboard input
.text // keep everything else in __TEXT
Then, because segments may be moved around arbitrarily at link-time, Apple's toolchain will no longer allow you to use adr to get the address of that buffer - you will have to use adrp and add:
adrp x1, msg#page
add x1, x1, msg#pageoff
If you want, you can tell the linker to please optimise this back to an adr if possible:
Lloh0:
adrp x1, msg#page
Lloh1:
add x1, x1, msg#pageoff
.loh AdrpAdd Lloh0, Lloh1
Then you need to remove this line:
ldr X1, [x1]
That would load the contents of the buffer, which would just be null bytes.
And finally, you should change the x0 value to exit to a constant:
mov x0, 0
The value in x2 will have been clobbered at this point, and you don't need it anyway.
As a reference for anyone in the future looking for an example to read from Standard In on AppleSilicon (M1), this code (based on the above information) works. It takes in a string up to 20 characters and prints it back out to the Standard Output.
.global _start
.align 2
_start:
// READ IN FROM KEYBOARD
mov X16, 3 // Tell system we want to read from StdIn (#3)
mov X0, 0 // Focus on the keyboard (#0)
mov X2, 20 // Define length of string to read in
adrp x1, msg#page // Load the address of the message
add x1, x1, msg#pageoff // Store the address to x1
svc 0 // Call kernel to perform the action
_write:
mov X16, 4 // Tell system we want to write to StdOut (#4)
mov X0, 1 // Focus on the screen (#1)
adrp x1, msg#page // Load the address of the message
add x1, x1, msg#pageoff // Store the address to x1
svc 0 // Call kernel to perform the action
_end:
mov X0, 0 // Return 0 (get a run error without this)
mov X16, 1 // System call to terminate this program
svc 0 // Call kernel to perform the action
.data
msg:
.ds 20 // 20 bytes of memory for keyboard input
Your makefile should look like this:
temp: temp.o
ld -o temp temp.o -lSystem -syslibroot `xcrun -sdk macosx --show-sdk-path` -e _start -arch arm64
temp.o: temp.s
as -arch arm64 -o temp.o temp.s
Related
This question already has an answer here:
Calling printf from aarch64 asm code on Apple M1 / MacOS
(1 answer)
Closed last month.
I'm new to assembly programming, but I've been figuring a lot out by googling and trial and error. I'm trying to write a simple program that prompts the user to enter a number (with _printf), then reads in and saves that number (_scanf), then prints out a message using the stored number (_printf).
I was able to get the _printf code to work under aarch64 (Apple Silicon) assembly, but no matter what I do, I cannot seem to get _scanf to work. I have looked through the ARM Developer docs, looked at the HelloSilicon github page, and googled for hours, and I cannot come up with anything that works.
In my code (included below), if I comment out the "read_from_keyboard" branch in the following code, the printf functions work just fine. But when I include the "read_from_keyboard" code, I get a "Segmentation fault: 11" error.
Where is my mistake?
.global main
.align 4
main:
// PRINT MESSAGE
ADRP X0, message#PAGE
ADD X0, X0, message#PAGEOFF
BL _printf
// BL read_from_keyboad
// READ NUMBER FROM DATA AND MOVE TO STACK FOR PRINTING
ADRP X10, num#PAGE
ADD X10, X10, num#PAGEOFF
LDR X1, [X10]
STR X1, [SP, #-16]!
// LOAD THE PRINTF FORMATTED MESSAGE
ADRP X0, output_format#PAGE
ADD X0, X0, output_format#PAGEOFF
end:
BL _printf
mov X16, #1
svc 0
read_from_keyboard:
ADRP X0, input_format#PAGE
ADD X0, X0, input_format#PAGEOFF
ADRP X11, num#PAGE
ADD X11, X11, num#PAGEOFF
BL _scanf
ret
.data
.balign 4
message: .asciz "What is your favorite number?\n"
.balign 4
num: .word 32
.balign 4
input_format: .asciz "%d"
.balign 4
output_format: .asciz "Your favorite number is %d \n"
On the call to _printf, your variadic arg is in [sp]. On the call to _scanf, you put it in x11. Why? Just do the same str xN, [sp, #-16]! that you do on _printf, that'll fix your segfault.
In addition though, you also need a stack frame for read_from_keyboard. The bl _scanf clobbers x30, so the following ret would just get stuck in an infinite loop.
Fix these two issues and your code works:
.global _main
.align 4
_main:
// PRINT MESSAGE
ADRP X0, message#PAGE
ADD X0, X0, message#PAGEOFF
BL _printf
BL read_from_keyboard
// READ NUMBER FROM DATA AND MOVE TO STACK FOR PRINTING
ADRP X10, num#PAGE
ADD X10, X10, num#PAGEOFF
LDR X1, [X10]
STR X1, [SP, #-16]!
// LOAD THE PRINTF FORMATTED MESSAGE
ADRP X0, output_format#PAGE
ADD X0, X0, output_format#PAGEOFF
end:
BL _printf
mov X16, #1
svc 0
read_from_keyboard:
STP X29, X30, [SP, #-16]!
ADRP X0, input_format#PAGE
ADD X0, X0, input_format#PAGEOFF
ADRP X11, num#PAGE
ADD X11, X11, num#PAGEOFF
STR X11, [SP, #-16]!
BL _scanf
ADD SP, SP, #16
LDP X29, X30, [SP], #16
ret
.data
.balign 4
message: .asciz "What is your favorite number?\n"
.balign 4
num: .word 32
.balign 4
input_format: .asciz "%d"
.balign 4
output_format: .asciz "Your favorite number is %d \n"
With arm64, a literal for a nearby address can be loaded into a register with the adr instruction. According to the ARM-V8 Architecture Reference Manual the adr instruction:
ADR <Xd>, <label> Address of label at a PC-relative offset
can reference labels within +/-1MB. There is a page version with bit-31 set, adrp, for constructing larger offsets.
What I don't understand is why neither gcc 8.2 nor clang 7.0 for ARM64 use adr rather than an adrp and add pair for nearby variables. Optimization levels don't change this.
int write(int fd, const void *buf, int count);
void xyz(void)
{
write(2, "abc", 4);
}
xyz(): // #xyz()
adrp x1, .L.str
add x1, x1, :lo12:.L.str
orr w0, wzr, #0x2
orr w2, wzr, #0x4
b write(int, void const*, int)
.L.str:
.asciz "abc"
Can they not reason that this string literal is within +/-1MB? Is there a compiler attribute/switch to tell them this?
GCC will generate such code with -mcmodel=tiny:
.global xyz
.type xyz, %function
xyz:
.LFB0:
.cfi_startproc
mov w2, 4
adr x1, .LC0
mov w0, 2
b write
.cfi_endproc
.LFE0:
.size xyz, .-xyz
.section .rodata.str1.8,"aMS",#progbits,1
.align 3
.LC0:
.string "abc"
I am trying to compile binary with nostdlib flag on Aarch64 platform.
I've dealt successfully with it on x86-64 platform this way:
void _start() {
/* main body of program: call main(), etc */
/* exit system call */
asm("movl $1,%eax;"
"xorl %ebx,%ebx;"
"int $0x80"
);
}
Is there any analogue to do the same thing on aarch64 platform?(specifically system exit call)
The example hereafter should work on an aarch64-linux-gnu system - It does work using running qemu-aarch64 3.0 on my x86_64 linux system.
The most concise/loosely coupled source of information for learning purpose would be musl-libc source code in my humble opinion:
syscall_arch.h does contain the _syscall functions to be used depending on the number of arguments required by a given syscall,
syscall.h.in does contain defines for all system calls.
We should then use:
static inline long __syscall1(long n, long a)
{
register long x8 __asm__("x8") = n;
register long x0 __asm__("x0") = a;
__asm_syscall("r"(x8), "0"(x0));
}
and __NR_exit:
#define __NR_exit 93
#define __NR_exit_group 94
A basic example in C would be syscall-exit.c:
#include "syscall_arch.h"
#include "syscall.h.in"
int main(void)
{
// exiting with return code 1.
__syscall1(__NR_exit, 1);
// we should have exited.
for (;;);
}
Compiling/executing/checking return code:
/opt/linaro/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc -static -O0 -o exit-syscall exit-syscall.c
qemu-aarch64 exit-syscall
echo $?
1
A close look at the generated code for main() and __syscall1() using:
/opt/linaro/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-objdump -D exit-syscall > exit-syscall.lst
Would look like:
0000000000400554 <main>:
400554: a9bf7bfd stp x29, x30, [sp, #-16]!
400558: 910003fd mov x29, sp
40055c: d2800021 mov x1, #0x1 // #1
400560: d2800ba0 mov x0, #0x5d // #93
400564: 97fffff4 bl 400534 <__syscall1>
0000000000400534 <__syscall1>:
400534: d10043ff sub sp, sp, #0x10
400538: f90007e0 str x0, [sp, #8]
40053c: f90003e1 str x1, [sp]
400540: f94007e8 ldr x8, [sp, #8]
400544: f94003e0 ldr x0, [sp]
400548: d4000001 svc #0x0
40054c: 910043ff add sp, sp, #0x10
400550: d65f03c0 ret
See document "Procedure Call Standard for the ARM 64-bit Architecture(AArch64)" for more information.
Therefore, an Aarch64 equivalent of your x86_64 code would be exit-asm.c :
void main(void) {
/* exit system call - calling NR_exit with 1 as the return code*/
asm("mov x0, #1;"
"mov x8, #93;"
"svc #0x0;"
);
for (;;);
}
/opt/linaro/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc -static -o example example.c
qemu-aarch64 example
echo $?
1
Please note that glibc implementation of exit() does call __NR_exit_group prior to call __NR_exit.
I have recently wrote this code in assembly (ARM legV8) which uses recursion to calculate gcd between 64 bit random numbers. In the first place i generate 2 random numbers using random() and then i check if the are relatively prime ( gcd(num1,num2) == 1). If not then generate new numbers. Repeat until relatively prime numbers are found. The random() and gcd() functions are working fine (checked). I have an issue with the stack (Segmentation fault) when i try to store on it an updated value. I get error on the third time i call the loop on that line str x22,[x29,24] Here is my code :
.data //Start of the datasegment
q_initialState: .string "Please enter the intial state : "
scanner_hex: .string "%p"
output_num: .string "%llu\n"
gcd_result: .string "GCD is : %d\n"
message: .string "They are prime numbers \n"
.text //Start of the .text segment
.global main
main:
stp x29,x30,[sp,-32]!
add x29,sp,0
adr x0,q_initialState
bl printf
add x1,x29,28
adr x0,scanner_hex
bl scanf
ldr x19,[x29,28] //this is the initial state
loop:
mov x2,x19
bl random
mov x22,x5
mov x1,x22
adr x0,output_num
bl printf
mov x2,x19
bl random
mov x23,x5
mov x1,x23
adr x0,output_num
bl printf
str x22,[x29,24] // <----------- ERROR HERE ---------
str x23,[x29,28]
bl gcd
cmp x24,1
bne loop
ldp x29,x30,[sp],32
ret
random:
stp x29,x30,[sp,-32]!
add x29,sp,0
eor x2,x2,x2,lsr 12
eor x2,x2,x2,lsl 25
eor x2,x2,x2,lsr 27
ldr x4,=0x2545f4914f6cdd1d
mul x5,x2,x4
mov x19,x2
ldp x29,x30,[sp],32
ret
gcd:
stp x29,x30,[sp,-32]!
add x29,sp,0
ldr x9,[x29,56] //Value for variable m
ldr x10,[x29,60] //Value for variable n
udiv x11,x9,x10
mul x11,x11,x10
sub x1,x9,x11
cbz x1,return_n
str x10,[x29,24]
str x1,[x29,28]
bl gcd
ldp x29,x30,[sp],32
ret
return_n:
ldr x1,[x29,60]
mov x24,x1
ldp x29,x30,[sp],32
ret
Here is Go's undocumented Syscall function:
func Syscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno)
And here is the C definition:
long syscall(long number, ...);
Pretty different. So it's fairly obvious that trap is number, and a1, a2, and a3 allow for three arguments. I also worked out that r1 is the return value, and err is errno. But what is r2? The syscall man page doesn't mention multiple return values.
It does give the actual calling conventions (still only one retval):
arch/ABI instruction syscall # retval error Notes
────────────────────────────────────────────────────────────────────
alpha callsys v0 a0 a3 [1]
arc trap0 r8 r0 -
arm/OABI swi NR - a1 - [2]
arm/EABI swi 0x0 r7 r0 -
arm64 svc #0 x8 x0 -
blackfin excpt 0x0 P0 R0 -
i386 int $0x80 eax eax -
ia64 break 0x100000 r15 r8 r10 [1]
m68k trap #0 d0 d0 -
microblaze brki r14,8 r12 r3 -
mips syscall v0 v0 a3 [1]
nios2 trap r2 r2 r7
parisc ble 0x100(%sr2, %r0) r20 r28 -
powerpc sc r0 r3 r0 [1]
s390 svc 0 r1 r2 - [3]
s390x svc 0 r1 r2 - [3]
superh trap #0x17 r3 r0 - [4]
sparc/32 t 0x10 g1 o0 psr/csr [1]
sparc/64 t 0x6d g1 o0 psr/csr [1]
tile swint1 R10 R00 R01 [1]
x86_64 syscall rax rax - [5]
x32 syscall rax rax - [5]
xtensa syscall a2 a2 -
But on x86 this is the implementation
#define INVOKE_SYSCALL INT $0x80
TEXT ·Syscall(SB),NOSPLIT,$0-28
CALL runtime·entersyscall(SB)
MOVL trap+0(FP), AX // syscall entry
MOVL a1+4(FP), BX
MOVL a2+8(FP), CX
MOVL a3+12(FP), DX
MOVL $0, SI
MOVL $0, DI
INVOKE_SYSCALL
CMPL AX, $0xfffff001
JLS ok
MOVL $-1, r1+16(FP)
MOVL $0, r2+20(FP)
NEGL AX
MOVL AX, err+24(FP)
CALL runtime·exitsyscall(SB)
RET
ok:
MOVL AX, r1+16(FP)
MOVL DX, r2+20(FP)
MOVL $0, err+24(FP)
CALL runtime·exitsyscall(SB)
RET
Now, I don't read assembly too well, but I'm pretty sure it is returning EDX in r2. Why?
I think they have multiple return values for consistency. As you can see from that table, some architectures return multiple values and if you check a few of the other assembly files from that directory you'll see they move register values to r2.
But why DX? This part is still puzzling. Scattered across the web are docs mentioning on i386 a function is allowed to use both EAX and EDX for return values. For example System V Application Binary Interface Intel386 Architecture Processor Supplement:
%edx scratch register; also used to return the upper 32bits of some
64bit return types
Later it goes on to say:
The most significant 32 bits are returned in %edx. The least unsigned
long long significant 32 bits are returned in %eax.
Let's try this:
uint64_t some_function() {
return 18446744073709551614LLU;
}
Clang ends up producing:
pushl %ebp
movl %esp, %ebp
movl $-2, %eax
movl $-1, %edx
popl %ebp
ret
Interestingly, asm_linux_amd64.s seems to do the same thing, giving us a pretext to look at the System V ABI for AMD64. This also doc mentions in passing, about RDX:
used to pass 3rd argument to functions; 2nd return register
But Appendix A deals with Linux Conventions specifically.
The interface between the C library and the Linux kernel is the same
as for the user-level applications with the following differences:
Returning from the syscall, register %rax contains the result of the
system-call. A value in the range between -4095 and -1 indicates an error,
it is -errno.
No mention of RDX for the system call.
I won't put my hand in the fire for this (or in general) but I suspect taking DX is not necessary for Linux which doesn't make use of such large return values that they spill out of AX.