What does FSTP DWORD PTR DS:[ESI+1224] do?

What does FSTP DWORD PTR DS:[ESI+1224] do? - debugging

I am trying to learn more about assembly and disassembly.
My goal is to modify the way a specific address is being written using a debugger (olly). Preferably by incrementing it by a number (20, 50, etc..) I can identify the address of the floating point number (in this case located at 33B7420C).
When I set a breakpoint on memory access write it brings me to 00809B2E which has the following assembly:
FSTP DWORD PTR DS:[ESI+1224]
What exactly is it doing in this address? I know that the FPU register has the number i'm looking for but not sure what all this address is doing.
The closest I come to googling is:
What does MOV EAX, DWORD PTR DS:[ESI] mean and what does it do?
A copy of the registers shows the following:
EAX 00000000
ECX 00A16E40 EZ.00A16E40
EDX FFFFFFFF
EBX 33B74578
ESP 0018FA90
EBP 00000000
ESI 33B72FE8
EDI 33B74578
EIP 00809B2E <EZ.Breakpoint for time>
C 0 ES 002B 32bit 0(FFFFFFFF)
P 0 CS 0023 32bit 0(FFFFFFFF)
A 0 SS 002B 32bit 0(FFFFFFFF)
Z 0 DS 002B 32bit 0(FFFFFFFF)
S 0 FS 0053 32bit 7EFDD000(FFF)
T 0 GS 002B 32bit 0(FFFFFFFF)
D 0
O 0 LastErr ERROR_SUCCESS (00000000)
EFL 00210202 (NO,NB,NE,A,NS,PO,GE,G)
ST0 valid 1150.0000000000000000
ST1 zero 0.0
ST2 zero 0.0
ST3 empty 64.951911926269531250
ST4 empty -13.250000000000000000
ST5 empty 64.951911926269531250
ST6 empty 64.951911926269531250
ST7 empty 0.0239995196461677551
3 2 1 0 E S P U O Z D I
FST 2927 Cond 0 0 0 1 Err 0 0 1 0 0 1 1 1 (LT)
FCW 027F Prec NEAR,53 Mask 1 1 1 1 1 1
Any help would be appreciated, Thanks!

FSTP stores a floating point number from the top of the floating-point register stack (ST0) to the designated memory region. Using the DWORD modifier means that a 32-bit float will be written. The P suffix indicates that the floating-point register stack will be popped after the operation.
So, in effect, this instruction puts 1150.0 (as a 32-bit float) at DS:[ESI+1224], then pops the register stack (which causes ST0 = 0.0, ST1 = 0.0, ST2 = <empty>, etc.).

It's storing ST0 (1150.0) in single-precision to your address. And popping said value from the FPU stack.

To add 50 (0x32 being hex for 50):
mov eax, dword[ds:esi+0x1224]
add eax, 0x32
mov dword[ds:esi+0x1224], eax

Related

How to print signed integer in x86 assembly (NASM) on Mac

I found an implementation of unsigned integer conversion in x86 assembly, and I tried plugging it in but being new to assembly and not having a debugging env there yet, it's difficult to understand why it's not working. I would also like it to work with signed integers so it can capture error messages from syscalls.
Wondering if one could show how to fix this code to get the signed integer to print, without using printf but using strprn provided by this answer.
%define a rdi
%define b rsi
%define c rdx
%define d r10
%define e r8
%define f r9
%define i rax
%define EXIT 0x2000001
%define EXIT_STATUS 0
%define READ 0x2000003 ; read
%define WRITE 0x2000004 ; write
%define OPEN 0x2000005 ; open(path, oflag)
%define CLOSE 0x2000006 ; CLOSE
%define MMAP 0x2000197 ; mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t offset)
; szstr computes the lenght of a string.
; rdi - string address
; rdx - contains string length (returned)
strsz:
xor rcx, rcx ; zero rcx
not rcx ; set rcx = -1 (uses bitwise id: ~x = -x-1)
xor al,al ; zero the al register (initialize to NUL)
cld ; clear the direction flag
repnz scasb ; get the string length (dec rcx through NUL)
not rcx ; rev all bits of negative -> absolute value
dec rcx ; -1 to skip the null-term, rcx contains length
mov rdx, rcx ; size returned in rdx, ready to call write
ret
; strprn writes a string to the file descriptor.
; rdi - string address
; rdx - contains string length
strprn:
push rdi ; push string address onto stack
call strsz ; call strsz to get length
pop rsi ; pop string to rsi (source index)
mov rax, WRITE ; put write/stdout number in rax (both 1)
mov rdi, 1 ; set destination index to rax (stdout)
syscall ; call kernel
ret
; mov ebx, 0xCCCCCCCD
itoa:
xor rdi, rdi
call itoal
ret
; itoa loop
itoal:
mov ecx, eax ; save original number
mul ebx ; divide by 10 using agner fog's 'magic number'
shr edx, 3 ;
mov eax, edx ; store quotient for next loop
lea edx, [edx*4 + edx] ; multiply by 10
shl rdi, 8 ; make room for byte
lea edx, [edx*2 - '0'] ; finish *10 and convert to ascii
sub ecx, edx ; subtract from original number to get remainder
lea rdi, [rdi + rcx] ; store next byte
test eax, eax
jnz itoal
exit:
mov a, EXIT_STATUS ; exit status
mov i, EXIT ; exit
syscall
_main:
mov rdi, msg
call strprn
mov ebx, -0xCCCCCCCD
call itoa
call strprn
jmp exit
section .text
msg: db 0xa, " Hello StackOverflow!!!", 0xa, 0xa, 0
With this working it will be possible to properly print signed integers to STDOUT, so you can log the registers values.
https://codereview.stackexchange.com/questions/142842/integer-to-ascii-algorithm-x86-assembly
How to print a string to the terminal in x86-64 assembly (NASM) without syscall?
How do I print an integer in Assembly Level Programming without printf from the c library?
https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.html
How to get length of long strings in x86 assembly to print on assertion

My answer on How do I print an integer in Assembly Level Programming without printf from the c library? which you already linked shows that serializing an integer into memory as ASCII decimal gives you a length, so you have no use for (a custom version of) strlen here.
(Your msg has an assemble-time constant length, so it's silly not to use that.)
To print a signed integer, implement this logic:
if (x < 0) {
print('-'); // or just was_negative = 1
x = -x;
}
unsigned_intprint(x);
Unsigned covers the abs(most_negative_integer) case, e.g. in 8-bit - (-128) overflows to -128 signed. But if you treat the result of that conditional neg as unsigned, it's correct with no overflow for all inputs.
Instead of actually printing a - by itself, just save the fact that the starting number was negative and stick the - in front of the other digits after generating the last one. For bases that aren't powers of 2, the normal algorithm can only generate digits in reverse order of printing,
My x86-64 print integer with syscall answer treats the input as unsigned, so you should simply use that with some sign-handling code around it. It was written for Linux, but replacing the write system call number will make it work on Mac. They have the same calling convention and ABI.
And BTW, xor al,al is strictly worse than xor eax,eax unless you specifically want to preserve the upper 7 bytes of RAX. Only xor-zeroing of full registers is handled efficiently as a zeroing idiom.
Also, repnz scasb is not fast; about 1 compare per clock for large strings.
For strings up to 16 bytes, you can use a single XMM vector with pcmpeqb / pmovmskb / bsf to find the first zero byte, with no loop. (SSE2 is baseline for x86-64).

Assembly program crashes on call or exit

Debugging my code in VS2015, I get to the end of the program. The registers are as they should be, however, on call ExitProcess, or any variation of that, causes an "Access violation writing location 0x00000004." I am utilizing Irvine32.inc from Kip Irvine's book. I have tried using call DumpRegs, but that too throws the error.
I have tried using other variations of call ExitProcess, such as exit and invoke ExitProcess,0 which did not work either, throwing the same error. Before, when I used the same format, the code worked fine. The only difference between this code and the last one is utilizing the general purpose registers.
include Irvine32.inc
.data
;ary dword 100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
ary dword -24, 1, -5, 30, 35, 81, 94, 143, 0
.code
main PROC
;ESI will be used for the array
;EDI will be used for the array value
;ESP will be used for the array counting
;EAX will be used for the accumulating sum
;EBX will be used for the average
;ECX will be used for the remainder of avg
;EBP will be used for calculating remaining sum
mov eax,0 ;Set EAX register to 0
mov ebx,0 ;Set EBX register to 0
mov esp,0 ;Set ESP register to 0
mov esi,OFFSET ary ;Set ESI register to array
sum: mov edi,[esi] ;Set value to array value
cmp edi,0 ;Check value to temination value 0
je finsum ;If equal, jump to finsum
add esp,1 ;Add 1 to array count
add eax,edi ;Add value to sum
add esi,4 ;Increment to next address in array
jmp sum ;Loop back to sum array
finsum: mov ebp,eax ;Set remaining sum to the sum
cmp ebp,0 ;Compare rem sum to 0
je finavg ;Jump to finavg if sum is 0
cmp ebp,esp ;Check sum to array count
jl finavg ;Jump to finavg if sum is less than array count
avg: add ebx,1 ;Add to average
sub ebp,esp ;Subtract array count from rem sum
cmp ebp,esp ;Compare rem sum to array count
jge avg ;Jump to avg if rem sum is >= to ary count
finavg: mov ecx,ebp ;Set rem sum to remainder of avg
call ExitProcess
main ENDP
END MAIN
Registers before call ExitProcess
EAX = 00000163 EBX = 0000002C ECX = 00000003 EDX = 00401055
ESI = 004068C0 EDI = 00000000 EIP = 0040366B ESP = 00000008
EBP = 00000003 EFL = 00000293
OV = 0 UP = 0 EI = 1 PL = 1 ZR = 0 AC = 1 PE = 0 CY = 1

mov esp,0 sets the stack pointer to 0. Any stack instructions like push/pop or call/ret will crash after you do that.
Pick a different register for your array-count temporary, not the stack pointer! You have 7 other choices, looks like you still have EDX unused.
In the normal calling convention, only EAX, ECX, and EDX are call-clobbered (so you can use them without preserving the caller's value). But you're calling ExitProcess instead of returning from main, so you can destroy all the registers. But ESP has to be valid when you call.
call works by pushing a return address onto the stack, like sub esp,4 / mov [esp], next_instruction / jmp ExitProcess. See https://www.felixcloutier.com/x86/CALL.html. As your register-dump shows, ESP=8 before the call, which is why it's trying to store to absolute address 4.
Your code has 2 sections: looping over the array and then finding the average. You can reuse a register for different things in the 2 sections, often vastly reducing register pressure. (i.e. you don't run out of registers.)
Using implicit-length arrays (terminated by a sentinel element like 0) is unusual outside of strings. It's much more common to pass a function a pointer + length, instead of just a pointer.
But anyway, you have an implicit-length array so you have to find its length and remember that when calculating the average. Instead of incrementing a size counter inside the loop, you can calculate it from the pointer you're also incrementing. (Or use the counter as an array index like ary[ecx*4], but pointer-increments are often more efficient.)
Here's what an efficient (scalar) implementation might look like. (With SSE2 for SIMD you could add 4 elements with one instruction...)
It only uses 3 registers total. I could have used ECX instead of ESI (so main could ret without having destroyed any of the registers the caller expected it to preserve, only EAX, ECX, and EDX), but I kept ESI for consistency with your version.
.data
;ary dword 100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
ary dword -24, 1, -5, 30, 35, 81, 94, 143, 0
.code
main PROC
;; inputs: static ary of signed dword integers
;; outputs: EAX = array average, EDX = remainder of sum/size
;; ESI = array count (in elements)
;; clobbers: none (other than the outputs)
; EAX = sum accumulator
; ESI = array pointer
; EDX = array element temporary
xor eax, eax ; sum = 0
mov esi, OFFSET ary ; incrementing a pointer is usually efficient, vs. ary[ecx*4] inside a loop or something. So this is good.
sumloop: ; do {
mov edx, [esi]
add edx, 4
add eax, edx ; sum += *p++ without checking for 0, because + 0 is a no-op
test edx, edx ; sets FLAGS the same as cmp edx,0
jnz sumloop ; }while(array element != 0);
;;; fall through if the element is 0.
;;; esi points to one past the terminator, i.e. two past the last real element we want to count for the average
sub esi, OFFSET ary + 4 ; (end+4) - (start+4) = array size in bytes
shr esi, 2 ; esi = array length = (end-start)/element_size
cdq ; sign-extend sum into EDX:EAX as an input for idiv
idiv esi ; EAX = sum/length EDX = sum%length
call ExitProcess
main ENDP
I used x86's hardware division instruction, instead of a subtraction loop. Your repeated-subtraction loop looked pretty complicated, but manual signed division can be tricky. I don't see where you're handling the possibility of the sum being negative. If your array had a negative sum, repeated subtraction would make it grow until it overflowed. Or in your case, you're breaking out of the loop if sum < count, which will be true on the first iteration for a negative sum.
Note that comments like Set EAX register to 0 are useless. We already know that from reading mov eax,0. sum = 0 describes the semantic meaning, not the architectural effect. There are some tricky x86 instructions where it does make sense to comment about what it even does in this specific case, but mov isn't one of them.
If you just wanted to do repeated subtraction with the assumption that sum is non-negative to start with, it's as simple as this:
;; UNSIGNED division (or signed with non-negative dividend and positive divisor)
; Inputs: sum(dividend) in EAX, count(divisor) in ECX
; Outputs: quotient in EDX, remainder in EAX (reverse of the DIV instruction)
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb subloop_end ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
cmp eax, ecx
jae repeat_subtraction ; while( dividend >= divisor );
; fall through when eax < ecx (unsigned), leaving EAX = remainder
subloop_end:
Notice how checking for special cases before entering the loop lets us simplify it. See also Why are loops always compiled into "do...while" style (tail jump)?
sub eax, ecx and cmp eax, ecx in the same loop seems redundant: we could just use sub to set flags, and correct for the overshoot.
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb division_done ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
jnc repeat_subtraction ; while( dividend -= divisor doesn't wrap (carry) );
add eax, ecx ; correct for the overshoot
dec edx
division_done:
(But this isn't actually faster in most cases on most modern x86 CPUs; they can run the inc, cmp, and sub in parallel even if the inputs weren't the same. This would maybe help on AMD Bulldozer-family where the integer cores are pretty narrow.)
Obviously repeated subtraction is total garbage for performance with large numbers. It is possible to implement better algorithms, like one-bit-at-a-time long-division, but the idiv instruction is going to be faster for anything except the case where you know the quotient is 0 or 1, so it takes at most 1 subtraction. (div/idiv is pretty slow compared to any other integer operation, but the dedicated hardware is much faster than looping.)
If you do need to implement signed division manually, normally you record the signs, take the unsigned absolute value, then do unsigned division.
e.g. xor eax, ecx / sets dl gives you dl=0 if EAX and ECX had the same sign, or 1 if they were different (and thus the quotient will be negative). (SF is set according to the sign bit of the result, and XOR produces 1 for different inputs, 0 for same inputs.)

Mysterious side-effect of calling printf or not in C before calling an asm function?

This program has to calculate pi with accuracy provided by the user.
The calculate_pi () function is written in NASM.
Can someone explain to me why if this line is commented:
//printf("accuracy: %.15f\n", precision); //<- This line
The program does not work correctly. Send strange numbers to the calcuta_pi() function? If this line is commented, a very small value is sent to the function and the program runs infinitely.
But if it is not a commented program works correctly.
#include <stdio.h>
#include <math.h>
extern double calculate_pi(double precision); /* external function declaration */
double calculate_pi(double precision); /* function prototype */
int main()
{
double precision = 1;
printf("A program that calculates pi, with accuracy provided by the user\n");
printf("Give me accuracy\n");
while(1)
{
if (scanf("%lf", &precision) != 1)
{
printf("reading error\n");
fseek(stdin,0,SEEK_END);
continue;
}
if(precision<0)
precision = fabs(precision);
//printf("accuracy: %.15f\n", precision); //<- This line
printf("pi: %.15f\n", calculate_pi(precision));
}
return 0;
}
This is my assembly code:
; arctg(1)=a
; tg(arctg(1))=tg(a)
; atan(x) = x - x^3/3 + x^5/5 - x^7/7 + x^9/9..
; PI/4 = atan(1) = 1 - 1/3 + 1/5 - 1/7 + 1/9...
; PI = (4/1) - (4/3) + (4/5) - (4/7) + (4/9) - (4/11) + (4/13) - (4/15) ...
section .text use32
global _calculate_pi
_calculate_pi:
%idefine a [ebp+12]
;ramka stosu
push ebp
mov ebp, esp
;ustawianie zmiennych
fld qword [const_wynik]
fstp qword [wynik]
fld qword [const_licznik]
fstp qword [licznik]
fld qword [const_mianownik]
fstp qword [mianownik]
.loop:
finit ; inicjalizacja stosu FPU
fld qword [licznik] ;licznik na stos
fld qword [mianownik] ;mianownik na stos
fdiv ;wynik dzielenia st1/st0
fadd qword [wynik] ;st0 = wynik dzielenia + [wynik]
fstp qword [wynik] ;wywalamy z st0 do [wynik]
;zmieniamy mianownik + 2
fld qword [mianownik] ;mianownik na stos
fadd qword [zwiekszmian] ;st0 = mianownik + 2
fstp qword [mianownik] ;wywalamy z st0 do [mianownik]
;zmieniamy licznik *(-1)
fld qword [licznik] ;licznik na stos
fchs ;st0 = -st0 = -licznik
fstp qword [licznik] ;wywalamy z st0 do [licznik]
;sprawdzanie dokladnosci
fld qword[wynik] ;wynik na stos
fldpi ;pi na stos
fsub ;st0 = wynik-pi = st1 - st0
fabs ;st0 = |wynik-pi|
fld qword a ;st0 = zadana dokladnosc
;(Unordered Compare ST(0) to ST(i) and set CPU flags and Pop ST(0))
;Przyrostek p oznacza obniżenie stosu rejestrów koprocesora, przyrostek i oznacza zapisywanie wyników bezpośrednio do flag procesora a nie flag koprocesora
fucomip st0, st1 ;porownanie z dokladnoscia if(zadana dokladnosc > uzyskana)
jb .loop ;only the C0 bit (CF flag) would be set if no error
fld qword [wynik]
;zwraca to co w st0
leave ; LEAVE = mov esp, ebp / pop ebp
ret
section .data:
wynik dq 4.0
licznik dq -4.0
mianownik dq 3.0
zwiekszmian dq 2.0
const_wynik dq 4.0
const_licznik dq -4.0
const_mianownik dq 3.0
Sample output:
I'm using:
NASM version 2.11.06 compiled on Oct 20 2014
gcc (MinGW.org GCC-6.3.0-1) 6.3.0
Compilation and assembler commands:
nasm -o pi.o -f coff pi.asm
gcc pi.o pi_interface.c -o projekt.exe -Wall -Wextra

I think you're accessing the function arg incorrectly, offset by 4 bytes. When you make a stack frame, the first arg is at [ebp+8], but you're loading from [ebp+12]. (This applies to all calling conventions that pass args on the stack. I think 32-bit mingw does that for double by default.)
This means the double value you're using as precision has its high 4 bytes from whatever the caller happened to leave on the stack above the 8-byte arg slot. This explains why changes in the caller affect the behaviour of your function, and why you can get an infinite loop: if the bytes you load happen to represent a very small double, your loop never exits.
The low 4 bytes (32 least-significant bits of the mantissa) come from the top 4 bytes of what the caller passed.
You would have easily found this with a debugger by looking at registers and noticing the value you load wasn't the value the caller passed. Also, #Ped7g's suggestion to just try returning precision in a trivial asm function would have found the problem, too.

Triple fault on stack in C code [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I am writing a toy kernel for learning purposes, and I am having a bit of trouble with it. I have made a simple bootloader that loads a segment from a floppy disk (which is written in 32 bit code), then the bootloader enables the A20 gate and turns on protected mode. I can jump to the 32 bit code fine if I write it in assembler, but if I write it in C, I get a triple fault. When I disassemble the C code I can see that the first two instructions involve setting up a new stack frame. This is the key difference between the working ASM code and the failing C code.
I am using NASM v2.10.05 for the ASM code, and GCC from the DJGPP 4.72 collection for the C code.
This is the bootloader code:
org 7c00h
BITS 16
entry:
mov [drive], dl ;Save the current drive
cli
mov ax,cs ; Setup segment registers
mov ds,ax ; Make DS correct
mov ss,ax ; Make SS correct
mov bp,0fffeh
mov sp,0fffeh ;Setup a temporary stack
sti
;Set video mode to text
;===================
mov ah, 0
mov al, 3
int 10h
;===================
;Set current page to 0
;==================
mov ah, 5
mov al, 0
int 10h
;==================
;Load the sector
;=============
call load_image
;=============
;Clear interrupts
;=============
cli
;=============
;Disable NMIs
;============
in ax, 70h
and ax, 80h ;Set the high bit to 1
out 70h, ax
;============
;Enable A20:
;===========
mov ax, 02401h
int 15h
;===========
;Load the GDT
;===========
lgdt [gdt_pointer]
;===========
;Clear interrupts
;=============
cli
;=============
;Enter protected mode
;==================
mov eax, cr0
or eax, 1 ;Set the low bit to 1
mov cr0, eax
;==================
jmp 08h:clear_pipe ;Far jump to clear the instruction queue
;======================================================
load_image:
reset_drive:
mov ah, 00h
; DL contains *this* drive, given to us by the BIOS
int 13h
jc reset_drive
read_sectors:
mov ah, 02h
mov al, 01h
mov ch, 00h
mov cl, 02h
mov dh, 00h
; DL contains *this* drive, given to us by the BIOS
mov bx, 7E0h
mov es, bx
mov bx, 0
int 13h
jc read_sectors
ret
;======================================================
BITS 32 ;Protected mode now!
clear_pipe:
mov ax, 10h ; Save data segment identifier
mov ds, ax ; Move a valid data segment into the data segment register
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax ; Move a valid data segment into the stack segment register
mov esp, 90000h ; Move the stack pointer to 90000h
mov ebp, esp
jmp 08h:7E00h ;Jump to the kernel proper
;===============================================
;========== GLOBAL DESCRIPTOR TABLE ==========
;===============================================
gdt: ; Address for the GDT
gdt_null: ; Null Segment
dd 0
dd 0
gdt_code: ; Code segment, read/execute, nonconforming
dw 0FFFFh ; LIMIT, low 16 bits
dw 0 ; BASE, low 16 bits
db 0 ; BASE, middle 8 bits
db 10011010b ; ACCESS byte
db 11001111b ; GRANULARITY byte
db 0 ; BASE, low 8 bits
gdt_data: ; Data segment, read/write, expand down
dw 0FFFFh
dw 0
db 0
db 10010010b
db 11001111b
db 0
gdt_end: ; Used to calculate the size of the GDT
gdt_pointer: ; The GDT descriptor
dw gdt_end - gdt - 1 ; Limit (size)
dd gdt ; Address of the GDT
;===============================================
;===============================================
drive: db 00 ;A byte to store the current drive in
times 510-($-$$) db 00
db 055h
db 0AAh
And this is the kernel code:
void main()
{
asm("mov byte ptr [0x8000], 'T'");
asm("mov byte ptr [0x8001], 'e'");
asm("mov byte ptr [0x8002], 's'");
asm("mov byte ptr [0x8003], 't'");
}
The kernel simply inserts those four bytes into memory, which I can check as I am running the code in a VMPlayer virtual machine. If the bytes appear, then I know the code is working. If I write code in ASM that looks like this, then the program works:
org 7E00h
BITS 32
main:
mov byte [8000h], 'T'
mov byte [8001h], 'e'
mov byte [8002h], 's'
mov byte [8003h], 't'
hang:
jmp hang
The only differences are therefore the two stack operations I found in the disassembled C code, which are these:
push ebp
mov ebp, esp
Any help on this matter would be greatly appreciated. I figure I am missing something relatively minor, but crucial, here, as I know this sort of thing is possible to do.

Try using the technique here:
Is there a way to get gcc to output raw binary?
to produce a flat binary of the .text section from your object file.

I'd try moving the address of your protected mode stack to something else - say 0x80000 - which (according to this: http://wiki.osdev.org/Memory_Map_(x86)) is RAM which is guaranteed free for use (as opposed to the address you currently use - 0x90000 - which apparently may not be present - depending a bit on what you're using as a testing environment).

NASM jmp wonkiness

I'm writing a Forth inner interpreter and getting stuck at what should be the simplest bit. Using NASM on Mac (macho)
msg db "k thx bye",0xA ; string with carriage return
len equ $ - msg ; string length in bytes
xt_test:
dw xt_bye ; <- SI Starts Here
dw 0
db 3,'bye'
xt_bye dw $+2 ; <- Should point to...
push dword len ; <-- code here
push dword msg ; <--- but it never gets here
push dword 1
mov eax, 0x4 ; print the msg
int 80h
add esp, 12
push dword 0
mov eax, 0x1 ; exit(0)
int 80h
_main:
mov si,xt_test ; si points to the first xt
lodsw ; ax now points to the CFA of the first word, si to the next word
mov di,ax
jmp [di] ; jmp to address in CFA (Here's the segfault)
I get Segmentation Fault: 11 when it runs. As a test, I can change _main to
_main:
mov di,xt_bye+2
jmp di
and it works
EDIT - Here's the simplest possible form of what I'm trying to do, since I think there are a few red herrings up there :)
a dw b
b dw c
c jmp _my_actual_code
_main:
mov si,a
lodsw
mov di,ax
jmp [di]
EDIT - After hexdumping the binary, I can see that the value in b above is actually 0x1000 higher than the address where label c is compiled. c is at 0x00000f43, but b contains 0x1f40

First, it looks extremely dangerous to use the 'si' and 'di' 16-bit registers on a modern x86 machine which is at least 32-bit.
Try using 'esi' and 'edi'. You might be lucky to avoid some of the crashes when 'xt_bye' is not larger than 2^16.
The other thing: there is no 'RET' at the end of xt_bye.
One more: see this linked question Help with Assembly. Segmentation fault when compiling samples on Mac OS X
Looks like you're changing the ESP register to much and it becomes unaligned by 16 bytes. Thus the crash.
One more: the
jmp [di]
may not load the correct address because the DS/ES regs are not used thus the 0x1000 offset.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio