Multicore in NASM Windows: threads execute randomly - windows

I have code in NASM (64 bit) in Windows to run four simultaneous threads (each assigned to a separate core) on a four-core Windows x86-64 machine.
The threads are created in a loop. After thread creation, it calls WaitForMultipleObjects to coordinate the threads. The function to call is Test_Function (see code below).
Each thread (core) executes Test_Function across a large array. The first core starts at data element zero, the second core starts at 1, the third core starts at 2, the fourth core starts at 3, and each core increments by four (e.g., 0, 4, 8, 12).
In Test_Function I created a small test program that writes one of the input data values to the location corresponding to its startbyte, to verify that I have successfully created four threads and they return the correct data.
Each thread should write the stride value (32), but the test shows that the four fields are filled in randomly, with some fields showing as zero. If I repeat the test multiple times, I see there is no consistency to which fields will have the value 32 (the others always show as 0). That could be a side effect of WaitForMultipleObjects, but I haven't seen anything in the docs to confirm that.
Also, WaitForMultipleObjects waits on the ThreadHandles returned by CreateThread; when I examine the ThreadHandles array, it always shows like this: 268444374, 32, 1652, 1584. Only the first element looks like the size of a handle, the others do not look like handle values.
One possibility is that the two parameters passed on the stack may not be in the correct locations:
mov rax,0
mov [rsp+40],rax ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax ; ThreadID
According to the docs, ThreadCount should be a pointer. When I change the line to mov rax,ThreadCount (the pointer value), the program crashes. When I change it to:
mov rax,0
mov [rsp+32],rax ; use default creation flags
mov rax,ThreadCount
mov [rsp+40],rax ; ThreadID
now it reliably processes the first thread, but not threads 2-4.
So the bottom line is the threads are being created but they execute randomly, with some threads not executing at all, in no particular order. When I change the CreateThread parameters (as shown above) the first thread executes, but not threads 2-4.
Here is the test code showing the relevant parts. If a reproducible example is needed, I can prepare one.
Thanks for any ideas.
Init_Cores_fn:
; EACH OF THE CORES CALLS Test_Function AND EXECUTES THE WHOLE PROGRAM.
; WE PASS THE STARTING BYTE (0, 8, 16, 24) AND THE "STRIDE" = NUMBER OF CORES.
; ON RETURN, WE SYNCHRONIZE ANY DATA. ON ENTRY TO EACH CORE, SET THE REGISTERS
; Populate the ThreadInfo array with vars to pass
; ThreadInfo: length, startbyte, stride, vars into registers on entry to each core
mov rdi,ThreadInfo
mov rax,ThreadInfoLength
mov [rdi],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Register Vars
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
mov rbp,rsp ; preserve caller's stack frame
sub rsp,56 ; Shadow space
; _____
label_0:
mov rdi,ThreadInfo
mov rax,[FirstByte]
mov [rdi+8],rax ; 0, 8, 16, or 24
; _____
; Create Threads
mov rcx,0 ; lpThreadAttributes (Security Attributes)
mov rdx,0 ; dwStackSize
mov r8,Test_Function ; lpStartAddress (function pointer)
mov r9,ThreadInfo ; lpParameter (array of data passed to each core)
mov rax,0
mov [rsp+40],rax ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax ; ThreadID
call CreateThread
; Move the handle into ThreadHandles array (returned in rax)
mov rdi,ThreadHandles
mov rcx,[FirstByte]
mov [rdi+rcx],rax
mov rax,[FirstByte]
add rax,8
mov [FirstByte],rax
mov rax,[ThreadCount]
add rax,1
mov [ThreadCount],rax
mov rbx,4
cmp rax,rbx
jl label_0
; _____
; Wait
mov rcx,rax ; number of handles
mov rdx,ThreadHandles ; pointer to handles array
mov r8,1 ; wait for all threads to complete
mov r9,1000 ; milliseconds to wait
call WaitForMultipleObjects
; _____
;[ Code HERE to do cleanup if needed after the four threads finish ]
mov rsp,rbp
jmp label_900
; __________________
; The function for all threads to call
Test_Function:
; Populate registers
mov rdi,rcx
mov rax,[rdi]
mov r15,[rdi+24]
mov rax,[rdi+8] ; start byte
mov r13,[rdi+40]
mov r12,[rdi+48]
mov r10,[rdi+56]
xor r11,r11
xor r9,r9
pxor xmm15,xmm15
pxor xmm15,xmm14
pxor xmm15,xmm13
; Now test it - BUT the first thread does not write data
mov rcx,[rdi+8] ; start byte
mov rax,[rdi+16] ; stride
cvtsi2sd xmm0,rax
movsd [r15+rcx],xmm0
ret

I solved this problem, and here is the solution. Raymond Chen alluded to this in the comments above before urging me to use a higher level language, but I didn't understand it until today. I am posting this answer so it's easily accessible and understood by anyone who has the same problem in assembly language (or any other language) in the future because Raymond's comment (which I just upvoted) is now buried in the other comments above.
The ThreadInfo array, passed here as the fourth parameter to CreateThread (in r9 for Windows). Each core must have its own separate copy of ThreadInfo. In my application, the data in ThreadInfo are all the same except for the StartByte parameter (at rdi+8). Instead, I created a separate ThreadInfo array for each core (ThreadInfo1, 2, 3, and 4) and pass a pointer to the corresponding ThreadInfo array.
I implemented it in my application as a call to the following dup function but it could be implemented other ways as well:
DupThreadInfo:
mov rdi,ThreadInfo2
mov rax,8
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
; _____
mov rdi,ThreadInfo3
mov rax,0
mov [rdi],rax ; length (number of vars into registers plus 3 elements)
mov rax,16
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
mov rdi,ThreadInfo4
mov rax,0
mov [rdi],rax ; length (number of vars into registers plus 3 elements)
mov rax,24
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
ret
Because all data in the ThreadInfo arrays are the same except the second element, a more efficient way to do this would be to pass a 2-element array where the first element is the StartByte and the second element is a pointer to the static ThreadInfo array. That's especially important when we are working with more than four cores because the DupThreadInfo section would be needlessly long. That solution would avoid a call, but I haven't implemented that yet.

Related

Calling FormatMessageA from a 64 bit assembler program

I've been working for days on an assembler subroutine to call Windows API function FormatMessageA, and I think I must have some systematic misunderstanding. My routine is shown below. I have traced it in debug mode, and I know that at the time the function is called: rcx has hex 1B00, which is the flag values I specified (dwFlags); rdx has hex 33, which is the bogus initial handle value that I plugged in to provoke the error (lpSource); r8 has 6, which is the error message number that was returned by GetLastError, which equates to ERROR_INVALID_HANDLE (dwMessageId); r9 has 0, for default language (dwLanguageId); rsp+32 has the address of msgptrp, which is my area to receive the address of the error message as allocated by the called function (lpBuffer); rsp+40 has hex 28, or decimal 40, which is the minimum number of characters I arbitrarily specified for the error message (nSize); and rsp+48 has the address of argmntsp, which according to the documentation should be ignored with the flags I specified (*Arguments).
If there's a problem, and obviously there is since rax is returned as zero, I suspect that it has to do with lpBuffer, which the documentation says has data type LPTSTR, whatever that is. Sometimes I want to shout, "Okay, that's what it is in terms of C++, but what is it really? I hope someone can easily spot where I ran off the rails, because I'm at my wits' end with this, and it's something I need to get working in order to do effective error checking for my future endeavors.
goterr PROC
;
.data
; flag values used:
; hex 00000100 FORMAT_MESSAGE_ALLOCATE_BUFFER
; hex 00000200 FORMAT_MESSAGE_IGNORE_INSERTS
; hex 00000800 FORMAT_MESSAGE_FROM_HMODULE
; hex 00001000 FORMAT_MESSAGE_FROM_SYSTEM
flagsp dd 00001B00h ; those flags combined
saveinitp dq ?
bmaskp dq 0fffffffffffffff0h
savshadp dq ?
msgptrp dq ?
handlep dd ?
argmntsp dd ?
;
.code
mov saveinitp, rsp ; save initial contents of stack pointer in this proc
sub rsp, 56 ; shadow space (max of 7 parameters * 8)
and rsp, bmaskp ; make sure it's 16 byte aligned
mov savshadp, rsp ; save address of aligned shadow area for this proc
;
mov handlep, ecx ; save passed I/O handle value locally
;
call GetLastError ; get the specific error
;
mov rsp, savshadp ; shadow area for this proc
mov ecx, flagsp ; flags for Windows error routine
mov edx, handlep ; handle passed from caller
mov r8d, eax ; msg id from GetLastError in low order dword
mov r9, 0 ; default language id
lea rax, msgptrp ; pointer to receive address of msg buffer
mov [rsp+32], rax ; put it on the stack
mov rax, 40 ; set lower doubleword to minimum size for msg buffer
mov [rsp+40], rax ; put it on the stack
lea rax, argmntsp ; variable arguments parameter (not used)
mov [rsp+48], rax ; put it on the stack
call FormatMessageA ; if rax eq 0 the called failed, otherwise rax is no chars.
mov rsp, saveinitp ; restore initial SP for this proc
ret
goterr ENDP

InitializeCriticalSection fails in NASM

UPDATE: based on comments below, I revised the code below to add a struc and a pointer (new or revised code has "THIS IS NEW" or "THIS IS UPDATED" beside the code). Now the program does not crash, so the pointer is initialized, but the programs hangs at EnterCriticalSection. I suspect that in translating the sample MASM code below into NASM syntax, I did not declare the struc correctly. Any ideas? Thanks very much.
ORIGINAL QUESTION:
Below is a simple test program in 64-bit NASM, to test a critical section in
Windows. This is a dll and the entry point is Main_Entry_fn, which calls Init_Cores_fn, where we initialize four threads (cores) to call Test_fn.
I suspect that the problem is the pointer to the critical section. None of the online resources specifies what that pointer is. The doc "Using Critical Section Objects" at https://learn.microsoft.com/en-us/windows/desktop/sync/using-critical-section-objects shows a C++ example where the pointer appears to be relevant only to EnterCriticalSection and LeaveCriticalSection, but it's not a pointer to an independent object.
For those not familiar with NASM, the first parameter in a C++ signature goes into rcx and the second parameter goes into rds, but otherwise it should function the same as in C or C++. It's the same thing as InitializeCriticalSectionAndSpinCount(&CriticalSection,0x00000400) in C++.
Here's the entire program:
; Header Section
[BITS 64]
[default rel]
extern malloc, calloc, realloc, free
global Main_Entry_fn
export Main_Entry_fn
extern CreateThread, CloseHandle, ExitThread
extern WaitForMultipleObjects, GetCurrentThreadId
extern InitializeCriticalSectionAndSpinCount, EnterCriticalSection
extern LeaveCriticalSection, DeleteCriticalSection, InitializeCriticalSection
struc CRITICAL_SECTION ; THIS IS NEW
.cs_quad: resq 5
endstruc
section .data align=16
const_1000000000: dq 1000000000
ThreadID: dq 0
TestInfo: times 20 dq 0
ThreadInfo: times 3 dq 0
ThreadInfo2: times 3 dq 0
ThreadInfo3: times 3 dq 0
ThreadInfo4: times 3 dq 0
ThreadHandles: times 4 dq 0
Division_Size: dq 0
Start_Byte: dq 0
End_Byte: dq 0
Return_Data_Array: times 4 dq 0
Core_Number: dq 0
const_inf: dq 0xFFFFFFFF
SpinCount: dq 0x00000400
CriticalSection: ; THIS IS NEW
istruc CRITICAL_SECTION
iend
section .text
; ______________________________________
Init_Cores_fn:
; Calculate the data divisions
mov rax,[const_1000000000]
mov rbx,4 ;cores
xor rdx,rdx
div rbx
mov [End_Byte],rax
mov [Division_Size],rax
mov rax,0
mov [Start_Byte],rax
; Populate the ThreadInfo arrays to pass for each core
; ThreadInfo: (1) startbyte; (2) endbyte; (3) Core_Number
mov rdi,ThreadInfo
mov rax,[Start_Byte]
mov [rdi],rax
mov rax,[End_Byte]
mov [rdi+8],rax
mov rax,[Core_Number]
mov [rdi+16],rax
call DupThreadInfo ; Create ThreadInfo arrays for cores 2-4
mov rbp,rsp ; preserve caller's stack frame
sub rsp,56 ; Shadow space (was 32)
; _____
; Create four threads
label_0:
mov rax,[Core_Number]
cmp rax,0
jne sb2
mov rdi,ThreadInfo
jmp sb5
sb2:cmp rax,8
jne sb3
mov rdi,ThreadInfo2
jmp sb5
sb3:cmp rax,16
jne sb4
mov rdi,ThreadInfo3
jmp sb5
sb4:cmp rax,24
jne sb5
mov rdi,ThreadInfo4
sb5:
; _____
; Create Threads
mov rcx,0 ; lpThreadAttributes (Security Attributes)
mov rdx,0 ; dwStackSize
mov r8,Test_fn ; lpStartAddress (function pointer)
mov r9,rdi ; lpParameter (array of data passed to each core)
mov rax,0
mov [rsp+32],rax ; use default creation flags
mov rdi,ThreadID
mov [rsp+40],rdi ; ThreadID
call CreateThread
; Move the handle into ThreadHandles array (returned in rax)
mov rdi,ThreadHandles
mov rcx,[Core_Number]
mov [rdi+rcx],rax
mov rdi,TestInfo
mov [rdi+rcx],rax
mov rax,[Core_Number]
add rax,8
mov [Core_Number],rax
mov rbx,32 ; Four cores
cmp rax,rbx
jl label_0
mov rcx,CriticalSection ; THIS IS REVISED
mov rdx,[SpinCount]
call InitializeCriticalSectionAndSpinCount
; _____
; Wait
mov rcx,4 ;rax ; number of handles
mov rdx,ThreadHandles ; pointer to handles array
mov r8,1 ; wait for all threads to complete
mov r9,[const_inf] ;4294967295 ;0xFFFFFFFF
call WaitForMultipleObjects
; _____
mov rsp,rbp ; can we push rbp so we can use it internally?
jmp label_900
; ______________________________________
Test_fn:
mov rdi,rcx
mov r14,[rdi] ; Start_Byte
mov r15,[rdi+8] ; End_Byte
mov r13,[rdi+16] ; Core_Number
;______
; while(n < 1000000000)
label_401:
cmp r14,r15
jge label_899
mov rcx,CriticalSection
call EnterCriticalSection
; n += 1
add r14,1
mov rcx,CriticalSection
call LeaveCriticalSection
jmp label_401
;______
label_899:
mov rdi,Return_Data_Array
mov [rdi+r13],r14
mov rbp,ThreadHandles
mov rax,[rbp+r13]
call ExitThread
ret
; __________
label_900:
mov rcx,CriticalSection
call DeleteCriticalSection
mov rdi,Return_Data_Array
mov rax,rdi
ret
; __________
; Main Entry
Main_Entry_fn:
push rdi
push rbp
call Init_Cores_fn
pop rbp
pop rdi
ret
DupThreadInfo:
mov rdi,ThreadInfo2
mov rax,8
mov [rdi+16],rax ; Core Number
mov rax,[Start_Byte]
add rax,[Division_Size]
mov [rdi],rax
mov rax,[End_Byte]
add rax,[Division_Size]
mov [rdi+8],rax
mov [Start_Byte],rax
mov rdi,ThreadInfo3
mov rax,16
mov [rdi+16],rax ; Core Number
mov rax,[Start_Byte]
mov [rdi],rax
add rax,[Division_Size]
mov [rdi+8],rax
mov [Start_Byte],rax
mov rdi,ThreadInfo4
mov rax,24
mov [rdi+16],rax ; Core Number
mov rax,[Start_Byte]
mov [rdi],rax
add rax,[Division_Size]
mov [rdi+8],rax
mov [Start_Byte],rax
ret
The code above shows the functions in three separate places, but of course we test them one at a time (but they all fail).
To summarize, my question is why do InitializeCriticalSection and InitializeCriticalSectionAndSpinCount both fail in the code above? The inputs are dead simple, so I don't understand why it should not work.
InitializeCriticalSection take pointer to critical section object
The process is responsible for allocating the memory used by a
critical section object, which it can do by declaring a variable of
type CRITICAL_SECTION.
so code can be something like (i use masm syntax)
CRITICAL_SECTION STRUCT
DQ 5 DUP(?)
CRITICAL_SECTION ends
extern __imp_InitializeCriticalSection:QWORD
extern __imp_InitializeCriticalSectionAndSpinCount:QWORD
.DATA?
CriticalSection CRITICAL_SECTION {}
.CODE
lea rcx,CriticalSection
;mov edx,400h
;call __imp_InitializeCriticalSectionAndSpinCount
call __imp_InitializeCriticalSection
also you need declare all imported functions as
extern __imp_funcname:QWORD
instead
extern funcname

Multicore in NASM Windows: lpParameter data are wrong on entry

I have code in NASM (64 bit) in Windows to run four simultaneous threads (each assigned to a separate core) on a four-core Windows x86-64 machine.
The lpParameter is passed in r9 (the data variables for each thread to pass to the function). I am passing a pointer to an 8-element internal array (ThreadInfo) which contains variables to put into registers on entry to the function (variables are stored in registers for optimization purposes). All four threads call the same function.
The problem is that on entry to the function, the data passed in the ThreadInfo array are all large (e.g. 128-bit) negative numbers. The array contains 64-bit integers. The array is read-only for each of the threads, so no thread can alter the memory. I don't understand why the values should be wrong. I tested them in the thread creation code and they are correct, but they return incorrect values when read from the called function.
Here is the test code showing the relevant parts. Showing a reproducible example may not be possible due to the register values, but if this question can't be answered without that, then I can see if it's feasible to post a completely reproducible program.
Init_Cores_fn:
; EACH OF THE CORES CALLS Test_Function AND EXECUTES THE WHOLE PROGRAM.
; WE PASS THE STARTING BYTE (0, 8, 16, 24) AND THE "STRIDE" = NUMBER OF CORES.
; ON RETURN, WE SYNCHRONIZE ANY DATA. ON ENTRY TO EACH CORE, SET THE REGISTERS
; Populate the ThreadInfo array with vars to pass
; ThreadInfo: length, startbyte, stride, vars into registers on entry to each core
mov rdi,ThreadInfo
mov rax,ThreadInfoLength
mov [rdi],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Register Vars
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
mov rbp,rsp ; preserve caller's stack frame
sub rsp,56 ; Shadow space
; _____
label_0:
mov rdi,ThreadInfo
mov rax,[FirstByte]
mov [rdi+8],rax ; 0, 8, 16, or 24
; _____
; Create Threads
mov rcx,0 ; lpThreadAttributes (Security Attributes)
mov rdx,0 ; dwStackSize
mov r8,Test_Function ; lpStartAddress (function pointer)
mov r9,ThreadInfo ; lpParameter (array of data passed to each core)
mov rax,0
mov [rsp+40],rax ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax ; ThreadID
call CreateThread
; Move the handle into ThreadHandles array (returned in rax)
mov rdi,ThreadHandles
mov rcx,[FirstByte]
mov [rdi+rcx],rax
mov rax,[FirstByte]
add rax,8
mov [FirstByte],rax
mov rax,[ThreadCount]
add rax,1
mov [ThreadCount],rax
mov rbx,4
cmp rax,rbx
jl label_0
; _____
; Wait
mov rcx,rax ; number of handles
mov rdx,ThreadHandles ; pointer to handles array
mov r8,1 ; wait for all threads to complete
mov r9,1000 ; milliseconds to wait
call WaitForMultipleObjects
; _____
;[ Code HERE to do cleanup if needed after the four threads finish ]
mov rsp,rbp
jmp label_900
; __________________
; The function for all threads to call
Test_Function:
; Populate registers
mov rdi,r9
mov rax,[rdi]
mov r15,[rdi+24]
mov rax,[rdi+8] ; start byte
mov r13,[rdi+40]
mov r12,[rdi+48]
mov r10,[rdi+56]
xor r11,r11
xor r9,r9
pxor xmm15,xmm15
pxor xmm15,xmm14
pxor xmm15,xmm13
The pointer is passed in r9, but on entry to Test_Function, the values are all large (e.g., 128-bit) negative values. Why?
Thanks for any help on this.

Multicore in Windows: where to go after all threads are finished?

I am writing code in NASM (64 bit) in Windows to run four simultaneous threads (each assigned to a separate core) on a four-core Windows x86-64 machine. Below is my thread instantiation code. Each thread calls the same program.
The code is shown below in the exact sequence it appears in the source file (without the .data section, and some code abbreviated). Each thread (core) executes Test_Function across a large array. The first core starts at data element zero, the second core starts at 1, the third core starts at 2, the fourth core starts at 3, and each core increments by four (e.g., 0, 4, 8, 12). When finished, Test_Function exits at label_899. They write their results to the corresponding locations in a shared array created with malloc.
My question is: where do I go after all threads are finished at label_899? Do the threads then return to the final line of ThreadStart (after jl label_0), where cleanup is performed? Test_Function exits with ret, so I assume that returns control to the thread instantiation section (ThreadStart).
Unfortunately, all of the information I have read on this focuses on the creation of a single thread. Some resources talk about thread synchronization while running, but I have found nothing directly relevant about what happens after all threads are finished.
Thanks for any help. I would post a complete example, but I don't know yet how to complete it, which is my question.
ThreadStart:
mov rbp,rsp ; preserve caller's stack frame
mov rdi,ThreadInfo ; two element array
sub rsp,32 ; Shadow space
label_0:
mov rax,StartByte
mov [rdi],rax ; 0, 8, 16, or 24
mov rax,[JumpCount]
mov [rdi+8],rax ; number of cores (4 in this example)
push 0 ; Creation flags, start immediately (may want to delay until all created)
push 0 ; ThreadID
mov rcx,0 ; lpThreadAttributes (Security Attributes)
mov rdx,0 ; dwStackSize
mov r8,Test_Function ; lpStartAddress (function pointer)
mov r9,rdi ; lpParameter
call CreateThread
mov rax,8
add [StartByte],rax
mov rax,[TreadCount]
add rax,1
mov [TreadCount],rax
cmp rax,4
jl label_0
[ Code to do cleanup if needed after the four threads finish ]
mov rsp,rbp
jmp final_out
;______
Test_Function:
xor rcx,rcx
mov [loop_counter_401],rcx
label_401:
lea rdi,[rel numbers_ptr]
mov rbp,qword [rdi] ; Pointer
mov rcx,[loop_counter_401]
mov rax,[numbers_length]
cmp rcx,rax
jge label_899
movsd xmm0,qword[rbp+rcx]
movsd [num_float],xmm0
add rcx,32 ;8 -- add 32 for four loops, not 8
mov [loop_counter_401],rcx
[ MORE CODE]
jmp label_401
label_899:
ret
;______
final_out:
mov rdi,Return_Pointer_Array
mov rax,r15
mov [rdi+0],rax
mov rax,r14
mov [rdi+8],rax
mov rax,rdi
ret
Update 030119
Following is the code that creates threads as it is now. Earlier I though it failed at CreateThread, but it does not. It fails now due to a problem with the code I call, and I am debugging it now. I will post back with my results. The failure may be in how I wrote the code for WaitForMultipleObjects.
Shown is the thread instantiation code, not the code called by the threads.
mov rdi,ThreadInfo
mov rax,StartByte
mov [rdi+8],rax ; 0, 8, 16, or 24
; _____
; Create Threads
mov rcx,0 ; lpThreadAttributes (Security Attributes)
mov rdx,0 ; dwStackSize
mov r8,Test_Function ; lpStartAddress (function pointer)
mov rax,rdi
mov r9,rax ; lpParameter (array of data passed to each core)
mov rax,0
mov [rsp+40],rax ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax ; ThreadID
call CreateThread
; Move the handle into ThreadHandles array (returned in rax)
mov rdi,ThreadHandles
mov rcx,[StartByte]
mov [rdi+rcx],rax
mov rax,8
add [StartByte],rax
mov rax,[ThreadCount]
add rax,1
mov [ThreadCount],rax
cmp rax,4
jl label_0
; _____
; Wait
mov rcx,rax ; number of handles
mov rdx,ThreadHandles ; pointer to handles array
mov r8,1 ; wait for all threads to complete
mov r9,1000 ; milliseconds to wait
call WaitForMultipleObjects
; _____
;[ Code to do cleanup if needed after the four threads finish ]
mov rsp,rbp
jmp label_900

Access violation when trying to modify a byte in assembly procedure under NASM Windows

I'm totally new to assembly and I've been coding for about 2 weeks in Linux and able to output certain information on the terminal under Linux. However, today when I began coding under Windows, I ran into a problem, i'm unable to modify a byte in a string that I pass as a parameter to a function. I debugged the program in OllyDbg and it tells me that there's an access violation when writing to [address] - use Shift F7/F8/F9 to pass exception to program. I can't pass the exception to the program since there's no exception handler in the program. Is it illegal to modify a character in a string passed to a procedure? I have posted the code below.
section .bss
var resb 6
section .text
global _start
_start:
xor eax, eax ; empty all registers
xor ebx, ebx
xor ecx, ecx
xor edx, edx
jmp short get_library
get_lib_addr: ; get_lib_addr(char *name)
mov ecx, [esp+8] ; get the first parameter
xor edx, edx ; zero out register
mov byte [ecx+6], dl ; add null terminating character to string
jmp short fin ; go to end
get_library:
xor ecx, ecx
mov ebx, var ; start address of var
jmp start_loop ; start looping
start_loop:
cmp ecx, 5
jge end_loop
mov byte [ebx+ecx], 'h'
inc ecx
jmp start_loop
end_loop:
push dword var ; at this point - var = "hhhhh"
call get_lib_addr ; call get_lib_addr("hhhhh")
fin:
xor eax, eax
mov ebx, 0x750b4b80 ; ExitProcess
push eax ; eax = 0
call ebx ; ExitProcess(0)
And a screenshot of the debugging in OllyDbg.
https://imgur.com/a/48RAXOw
2nd case
Part of the code from Vivid Machines Shellcoding for Linux & Windows is copied below. When I try to use this syntax for passing parameters (assuming that's what is happening), I get an access violation trying to edit the string and adding a null terminating character in place of the N. I'm positive that the string is passed and ecx gets the correct string. Any reason this doesn't work?
LibraryReturn:
pop ecx ;get the library string
mov [ecx + 10], dl ;insert NULL - ***Access violation here***
;the N at the end of each string signifies the location of the NULL
;character that needs to be inserted
GetLibrary:
call LibraryReturn
db 'user32.dllN'
Screenshot of the debug information in the second scenario showing that N is indeed the value being edited.
https://imgur.com/a/W5vdR7d

Resources