How to use the monitor / mwait instructions in x86-64 assembly on Mac or baremetal - macos

Originally I asked about umonitor and umwait, but it turns out as #harold suggested, that you can't even buy a processor that has those instructions yet. So this question is about monitor and mwait, because I am interested how to use those on baremetal.
I just learned of monitor/mwait.
I would have thought that there would be no evaluation of any instructions once mwait is called, so I don't understand how other parts of memory could be written to. Unless perhaps this is some multithreaded stuff with shared memory of some sort, which I don't fully think I understand.
Wondering if one could whip up a quick hello world program to demonstrate how to use these instructions. My attempt at it is this.
global _main
section .text
_main:
call print1
; watch when address 1000
; (randomly chosen)
; is written to.
mov eax, 1000
monitor eax
; wait for 100 ms, not sure
; or some interrupt
mov eax, 100
mwait eax
call print2
call exit
print2:
mov rdx, msg2.len
mov rsi, msg2
mov rdi, 1
mov rax, 0x2000004
syscall
ret
print1:
mov rdx, msg1.len
mov rsi, msg1
mov rdi, 1
mov rax, 0x2000004
syscall
ret
exit:
mov rdi, 0 ; exit status
mov rax, 0x2000001 ;: exit
syscall
section .data
msg1: db "start", 0xa, 0
.len: equ $ - msg1
msg2: db "end", 0xa, 0
.len: equ $ - msg2
What I'm wondering is what an example usage looks like for (a) a timespan like 100ms delay, and/or (b) an "event" of writing to a specific part of memory to trigger the callback, and/or (c) an external event like keyboard interrupt or ctrl+c interrupt if there is such a thing. Or perhaps the timing thing is done with tpause.
Trying the following with tpause I get error: invalid combination of opcode and operands:
mov eax, 1000
mov edx, 1000
mov rdi, 0
tpause rdi, edx, eax
The few resources I've found:
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-303.html
https://software.intel.com/en-us/articles/how-to-use-the-monitor-and-mwait-streaming-simd-extensions-3-instructions
http://blog.andy.glew.ca/2010/11/httpsemipublic.html

Related

Reversing an array and printing it in x86-64

I am trying to print an array, reverse it, and then print it again. I manage to print it once. I can also make 2 consecutive calls to _printy and it works. But the code breaks with the _reverse function. It does not segfault, it exits with code 24 (I looked online but this seems to mean that the maximum number of file descriptors has been exceeded, and I cannot get what this means in this context). I stepped with a debugger and the loop logic seems to make sense.
I am not passing the array in RDI, because _printy restores the content of that register when it exits. I also tried to load it directly into RDI before calling _reverse but that does not solve the problem.
I cannot figure out what the problem is. Any idea?
BITS 64
DEFAULT REL
; -------------------------------------
; -------------------------------------
; PRINT LIST
; -------------------------------------
; -------------------------------------
%define SYS_WRITE 0x02000004
%define SYS_EXIT 0x02000001
%define SYS_OPEN 0x02000005
%define SYS_CLOSE 0x02000006
%define SYS_READ 0x02000003
%define EXIT_SUCCESS 0
%define STDOUT 1
%define LF 10
%define INT_OFFSET 48
section .text
extern _printf
extern _puts
extern _exit
global _main
_main:
push rbp
lea rdi, [rel array]
call _printy
call _reverse
call _printy
pop rbp
call _exit
_reverse:
push rbp
lea rsi, [rdi + 4 * (length - 1) ]
.LOOP2:
cmp rdi, rsi
jge .DONE2
mov r8, [rdi]
mov r9, [rsi]
mov [rdi], r9
mov [rsi], r8
add rdi,4
sub rsi,4
jmp .LOOP2
.DONE2:
xor rax, rax
lea rdi, [rel array]
pop rbp
ret
_printy:
push rbp
xor rcx, rcx
mov r8, rdi
.loop:
cmp rcx, length
jge .done
push rcx
push r8
lea rdi, [rel msg]
mov rsi, [r8 + rcx * 4]
xor rax, rax
call _printf
pop r8
pop rcx
add rcx, 1
jmp .loop
.done:
xor rax, rax
lea rdi, [rel array]
pop rbp
ret
section .data
array: dd 78, 2, 3, 4, 5, 6
length: equ ($ - array) / 4
msg: db "%d => ", 0
Edit with some info from the debugger
Stepping into the _printy function gives the following msg, once reaching the call to _printf.
* thread #1, queue = 'com.apple.main-thread', stop reason = step over failed (Could not create return address breakpoint.)
frame #0: 0x0000000100003f8e a.out`printf
a.out`printf:
-> 0x100003f8e <+0>: jmp qword ptr [rip + 0x4074] ; (void *)0x00007ff80258ef0b: printf
0x100003f94: lea r11, [rip + 0x4075] ; _dyld_private
0x100003f9b: push r11
0x100003f9d: jmp qword ptr [rip + 0x5d] ; (void *)0x00007ff843eeb520: dyld_stub_binder
I am not an expert, but a quick research online led to the following
During the 'thread step-out' command, check that the memory we are about to place a breakpoint in is executable. Previously, if the current function had a nonstandard stack layout/ABI, and had a valid data pointer in the location where the return address is usually located, data corruption would occur when the breakpoint was written. This could lead to an incorrectly reported crash or silent corruption of the program's state. Now, if the above check fails, the command safely aborts.
So after all this might not be a problem (I am also able to track the execution of the printf call). But this is really the only understandable piece of information I am able to extract from the debugger. Deep in some quite obscure (to me) function calls I reach this
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
frame #0: 0x00007ff80256db7f libsystem_c.dylib`flockfile + 10
libsystem_c.dylib`flockfile:
-> 0x7ff80256db7f <+10>: call 0x7ff8025dd480 ; symbol stub for: __error
0x7ff80256db84 <+15>: mov r14d, dword ptr [rax]
0x7ff80256db87 <+18>: mov rdi, qword ptr [rbx + 0x68]
0x7ff80256db8b <+22>: add rdi, 0x8
Target 0: (a.out) stopped.
(lldb)
Process 61913 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
frame #0: 0x00007ff8025dd480 libsystem_c.dylib`__error
This is one of the function calls happening in _printf.
Ask further questions if there is something more I can do.
Your array consists of int32 numbers aka dd in nasm terminology, but your swap operates on 64 bit numbers:
mov r8, [rdi]
mov r9, [rsi]
mov [rdi], r9
mov [rsi], r8
Assuming you were not after some crazy optimizations where you swap a pair of elements simultaneously you want this to remain in 32 bits:
mov r8d, [rdi]
mov r9d, [rsi]
mov [rdi], r9d
mov [rsi], r8d

Segmentation fault when adding 2 digits - nasm MacOS x86_64

I am trying to write a program that accepts 2 digits as user input, and then outputs their sum. I keep getting segmentation error when trying to run program(I am able to input 2 digits, but then the program crashes). I already check answers to similar questions and many of them pointed out to clear the registers, which I did, but I am still getting a segmentation fault.
section .text
global _main ;must be declared for linker (ld)
default rel
_main: ;tells linker entry point
call _readData
call _readData1
call _addData
call _displayData
mov RAX, 0x02000001 ;system call number (sys_exit)
syscall
_addData:
mov byte [sum], 0 ; init sum with 0
lea EAX, [buffer] ; load value from buffer to register
lea EBX, [buffer1] ; load value from buffer1 to register
sub byte [EAX], '0' ; transfrom to digit
sub byte [EBX], '0' ; transform to digit
add [sum], EAX ; increment value of sum by value from register
add [sum], EBX ; increment value of sum by value from 2nd register
add byte [sum], '0' ; convert to ASCI
xor EAX, EAX ; clear registers
xor EBX, EBX ; clear registers
ret
_readData:
mov RAX, 0x02000003
mov RDI, 2
mov RSI, buffer
mov RDX, SIZE
syscall
ret
_readData1:
mov RAX, 0x02000003
mov RDI, 2
mov RSI, buffer1
mov RDX, SIZE
syscall
ret
_displayData:
mov RAX, 0x02000004
mov RDI, 1
mov RSI, sum
mov RDX, SIZE
syscall
ret
section .bss
SIZE equ 4
buffer: resb SIZE
buffer1: resb SIZE
sum: resb SIZE
I see that, unlike other languages I learned, it is quite difficult to find a good source /tutorial about programming assembly using nasm on x86_64 architecture. Is there any kind of walkthrough for beginners(so I do not need to ask on SO everytime I am stuck :D)

x86 Assembly; overwriting .bss values?

I'm currently trying to write a small program in ASM, and I have the following issue. I take input from the user as a string which I store in a variable I've declared in the .bss section of my code; I then re-prompt and overwrite the previously stored answer and do this multiple times. My issue is if someone has entered an answer that was shorter than the last (i.e. "James" then "Jim") I get the following output:
"Hi, James"
"What's your name?"
"Jim"
"Hi, Jimes"
What's happening here is the characters that weren't overwritten remain and get printed, as expected. What I'm wondering is how I may go about wiping the data in the .bss db between prompts?
Here is the code so far:
section .data
question: db "What's your name?", 10
answer: db "Hello, "
ln db 10
section .bss
name resb 16
section .text
global start
start:
call prompt
call getName
mov rsi, answer
mov rdx, 7
call print
mov rsi, name
mov rdx, 10
call print
mov rsi, ln
mov rdx, 1
call print
call loop_name
mov rax, 0x02000001
mov rdi, 0
syscall
reset_name:
loop_name:
mov cx, 3
startloop:
cmp cx, 0
jz endofloop
push cx
loopy:
call getName
mov rsi, answer
mov rdx, 7
call print
mov rsi, name
mov rdx, 10
call print
pop cx
dec cx
jmp startloop
endofloop:
; Loop ended
; Do what ever you have to do here
ret
prompt:
mov rax, 0x02000004
mov rdi, 1
mov rsi, question
mov rdx, 18
syscall
print:
mov rax, 0x02000004
mov rdi, 1
syscall
ret
getName:
mov rax, 0x02000003 ; read
mov rdi, 0
mov rsi, name
mov rdx, 37
syscall
ret
Any ideas? (Variable in question is name)
While I don't know the system calls you're using, we can do one of three things:
clear the entire variable before reusing it.
use and share an explicit length value to indicate how many bytes of it are valid
null terminate the string right after it is input
Using an explicit length value may involve someone placing a null terminator at the right point in time (e.g. just before printing).
The read operation should return to you a length that you can pass to someone else (e.g. as a pair pointer & length), or otherwise use immediately to null terminate the string.  If it doesn't, then use the first approach of clearing the entire variable before reusing it.
Typically, syscalls have return values, that indicate length on success or else negative values for failure.  In such case, you are ignoring both.

Why doesn't this assembly code print the top of the stack?

After successfully making a "Hello, World!" program in x86-64, I wanted
to make a program that can peek at the top of the stack (without popping it, and using the esp register so I can learn how it works). This is the program in NASM:
extern GetStdHandle, WriteConsoleA, ExitProcess
section .bss
dummy resd 1
section .text
%macro print 3
mov rcx, %1
mov rdx, %2
mov r8, %3
mov r9, dummy
push NULL
call WriteConsoleA
%endmacro
_start:
mov rcx, STD_OUTPUT_HANDLE
call GetStdHandle
push 65
print rax, [x], 1
mov rcx, 0
call ExitProcess
NULL equ 0
STD_OUTPUT_HANDLE equ -11
At the print rax, [x], 1 line, x is replaced by something. I tried a variety of things, like rsp, esp, rsi, esi, rsp+1, rsp+4, etc. None of them worked. They either don't compile or don't print anything.
What is the correct way to do it? (note: this is solely for experimental purposes. I know I could use push/pop in this case, but I want to learn how to do it this way.)
mov rdx, [rsp] will load 65 into rdx. But WriteConsole expects the address of the string to print. So you want mov rdx, rsp.
One other thing that should be fixed: the stack should be aligned to 16 bytes before the call and there should be 32 bytes of empty space at the top of the stack. After the push, put sub rsp, 40. Then use rsp+40 as the address to print.

Open file, delete zeros, sort it - NASM

I am currently working on some problems and this is the one I am having trouble with. To make it all clear, I am a beginner, so any help is more than welcome.
Problem:
Sort the content of a binary file in descending order. The name of the file is passed as a command line argument. File content is interpreted as four-byte positive integers, where value 0, when found, is not written into the file. The result must be written in the same file that has been read.
The way I understand is that I have to have a binary file. Open it. Get its content. Find all characters while keeping in mind those are positive, four-byte integers, find zeros, get rid of zeros, sort the rest of the numbers.
We are allowed to use glibc, so this was my attempt:
section .data
warning db 'File does not exist!', 10, 0
argument db 'Enter your argument.', 10, 0
mode dd 'r+'
opened db 'File is open. Time to read.', 10, 0
section .bss
content resd 10
counter resb 1
section .text
extern printf, fopen, fgets, fputc
global main
main:
push rbp
mov rbp, rsp
push rsi
push rdi
push rbx
;location of argument's address
push rsi
cmp rdi, 2
je .openfile
mov rdi, argument
mov rax, 0
call printf
jmp .end
.openfile:
pop rbx
;First real argument of command line
mov rdi, [rbx + 8]
mov rsi, mode
mov rax, 0
call fopen
cmp al, 0
je .end
push rax
mov rdi, opened
mov rax, 0
call printf
.readfromfile:
mov rdi, content
mov rsi, 12 ;I wrote 10 numbers in my file
pop rdx
mov rax, 0
call fgets
cmp al, 0
je .end
push rax
mov rsi, tekst
pop rdi
.loop:
lodsd
inc byte[counter]
cmp eax, '0'
jne .loop
;this is the part where I am not sure what to do.
;I am trying to delete the zero with backspace, then use space and
;backspace again - I saw it here somewhere as a solution
mov esi, 0x08
call fputc
mov esi, 0x20
call fputc
mov esi, 0x08
call fputc
cmp eax, 0
je .end
jmp .loop
.end:
pop rdi
pop rsi
pop rbx
mov rsp, rbp
pop rbp
ret
So, my idea was to open the file, find zero, delete it by using backspace and space, then backspace again; Continue until I get to the end of the file, then sort it. As it can be seen I did not attempt to sort the content because I cannot get program to do the first part for me. I have been trying this for couple of days now and everything is getting foggy.
If someone can help me out, I would be very grateful. If there is something similar to this problem, feel free to link it to me. Anything that could help, I am ready to read and learn.
I am also unsure about how much information do I have to give. If something is unclear, please point it out to me.
Thank you
For my own selfish fun, an example of memory area being "collapsed" when dword zero value is detected:
to build in linux with NASM for target ELF64 executable:
nasm -f elf64 so_64b_collapseZeroDword.asm -l so_64b_collapseZeroDword.lst -w+all
ld -b elf64-x86-64 -o so_64b_collapseZeroDword so_64b_collapseZeroDword.o
And for debugger I'm using edb (built from sources) (the executable doesn't do anything observable by user, when it works correctly, it's supposed to be run in debugger single-stepping over instructions and having memory view over the .data segment to see how the values are moved around in memory).
source file so_64b_collapseZeroDword.asm
segment .text
collapseZeroDwords:
; input (custom calling convention, suitable only for calls from assembly):
; rsi - address of first element
; rdx - address beyond last element ("vector::end()" pointer)
; return: rdi - new "beyond last element" address
; modifies: rax, rsi, rdi
; the memory after new end() is not cleared (the zeroes are just thrown away)!
; search for first zero (up till that point the memory content will remain same)
cmp rsi, rdx
jae .noZeroFound ; if the (rsi >= end()), no zero was in the memory
lodsd ; eax = [rsi], rsi += 4
test eax, eax ; check for zero
jne collapseZeroDwords
; first zero found, from here on, the non-zero values will be copied to earlier area
lea rdi, [rsi-4] ; address where the non-zero values should be written
.moveNonZeroValues:
cmp rsi, rdx
jae .wholeArrayCollapsed ; if (rsi >= end()), whole array is collapsed
lodsd ; eax = [rsi], rsi += 4
test eax, eax ; check for zero
jz .moveNonZeroValues ; zero detected, skip the "store" value part
stosd ; [rdi] = eax, rdi += 4 (pointing beyond last element)
jmp .moveNonZeroValues
.noZeroFound:
mov rdi, rdx ; just return the original "end()" pointer
.wholeArrayCollapsed: ; or just return when rdi is already set as new end()
ret
global _start
_start: ; run some hardcoded simple tests, verify in debugger
lea rsi, [test1]
lea rdx, [test1+4*4]
call collapseZeroDwords
cmp rdi, test1+4*4 ; no zero collapsed
lea rsi, [test2]
lea rdx, [test2+4*4]
call collapseZeroDwords
cmp rdi, test2+3*4 ; one zero
lea rsi, [test3]
lea rdx, [test3+4*4]
call collapseZeroDwords
cmp rdi, test3+3*4 ; one zero
lea rsi, [test4]
lea rdx, [test4+4*4]
call collapseZeroDwords
cmp rdi, test4+2*4 ; two zeros
lea rsi, [test5]
lea rdx, [test5+4*4]
call collapseZeroDwords
cmp rdi, test5+2*4 ; two zeros
lea rsi, [test6]
lea rdx, [test6+4*4]
call collapseZeroDwords
cmp rdi, test6+0*4 ; four zeros
; exit back to linux
mov eax, 60
xor edi, edi
syscall
segment .data
; all test arrays are 4 elements long for simplicity
dd 0xCCCCCCCC ; debug canary value to detect any over-read or over-write
test1 dd 71, 72, 73, 74, 0xCCCCCCCC
test2 dd 71, 72, 73, 0, 0xCCCCCCCC
test3 dd 0, 71, 72, 73, 0xCCCCCCCC
test4 dd 0, 71, 0, 72, 0xCCCCCCCC
test5 dd 71, 0, 72, 0, 0xCCCCCCCC
test6 dd 0, 0, 0, 0, 0xCCCCCCCC
I tried to comment it extensively to show what/why/how it is doing, but feel free to ask about any particular part. The code was written with simplicity on mind, so it doesn't use any aggressive performance optimizations (like vectorized search for first zero value, etc).

Resources