What does MOV EAX,DWORD PTR DS:[ESI+EBP*8] do? - debugging

If I do step through the debugger in Ollydbg I see
MOV EAX,DWORD PTR DS:[ESI+EBP*8]
and register ESI = 0040855C and EBP = 00000000.
My problem is I dont know 2 register * 8

MOV EAX,DWORD PTR DS:[ESI+EBP*8]
MOV - move
EAX - to EAX (generally this will be a value you just calculated)
DWORD PTR - from the value pointed at by
[DS: - in the data segment]
[ESI+EBP*8] - ESI plus 8 times EBP.
Move the value in EAX into the address pointed at by ESI + EBP*8 (ESI plus 8 times EBP, it means exactly how it's written)
This is probably being used to load data from an array, where the 8 is there to scale up the counter (which is EBP) to the size of the thing being stored (8 bytes), and ESI contains the address of the start of the array. So if EBP is zero, you store the data in ESI+0, if EBP=1, you end up storing at ESI+8, etc.

In normal INTEL syntax this instruction moves a value from memory into EAX.
MOV EAX,DWORD PTR DS:[ESI+EBP*8]
It is usually used to extract a value from an array.
The array is situated in memory at DS:ESI.
The elements are indexed through EBP.
The scale of 8 means that every element is 64 bit long and this instruction only reads the low dword.

Related

Why does a function double dereference arguments stored on stack and how is that possible? [duplicate]

This question already has answers here:
Basic use of immediates vs. square brackets in YASM/NASM x86 assembly
(4 answers)
x86 Nasm assembly - push'ing db vars on stack - how is the size known?
(2 answers)
Referencing the contents of a memory location. (x86 addressing modes)
(2 answers)
Why do you have to dereference the label of data to store something in there: Assembly 8086 FASM
(1 answer)
Closed 7 months ago.
I tried to understand "lfunction" stack arguments loading to "flist" in following assembly code I found on a book (The book doesn't explain it. Code compiles and run without errors giving intended output displaying "The string is: ABCDEFGHIJ".) but I can't grasp the legality or logic of the code. What I don't understand is listed below.
In lfunction:
Non-volatile (as per Microsoft x64 calling convention) register RBX is not backed up before 'XOR'ing. (But it is not what bugs me most.)
In portion ";arguments on stack"
mov rax, qword [rbp+8+8+32]
mov bl,[rax]
Here [rbp+8+8+32] dereferences corresponding address stored in stack so RAX should
be loaded with value represented by'fourth' which is char 'D'(0x44) as per my understanding (Why qword?). And if so, what dereferencing char 'D' in second line can possibly mean (There should be a memory address to dereference but 'D' is a char.)?
Original code is listed below:
%include "io64.inc"
; stack.asm
extern printf
section .data
first db "A"
second db "B"
third db "C"
fourth db "D"
fifth db "E"
sixth db "F"
seventh db "G"
eighth db "H"
ninth db "I"
tenth db "J"
fmt db "The string is: %s",10,0
section .bss
flist resb 14 ;length of string plus end 0
section .text
global main
main:
push rbp
mov rbp,rsp
sub rsp, 8
mov rcx, flist
mov rdx, first
mov r8, second
mov r9, third
push tenth ; now start pushing in
push ninth ; reverse order
push eighth
push seventh
push sixth
push fifth
push fourth
sub rsp,32 ; shadow
call lfunc
add rsp,32+8
; print the result
mov rcx, fmt
mov rdx, flist
sub rsp,32+8
call printf
add rsp,32+8
leave
ret
;––––––––––––––––––––––––-
lfunc:
push rbp
mov rbp,rsp
xor rax,rax ;clear rax (especially higher bits)
;arguments in registers
mov al,byte[rdx] ; move content argument to al
mov [rcx], al ; store al to memory(resrved at section .bss)
mov al, byte[r8]
mov [rcx+1], al
mov al, byte[r9]
mov [rcx+2], al
;arguments on stack
xor rbx,rbx
mov rax, qword [rbp+8+8+32] ; rsp + rbp + return address + shadow
mov bl,[rax]
mov [rcx+3], bl
mov rax, qword [rbp+48+8]
mov bl,[rax]
mov [rcx+4], bl
mov rax, qword [rbp+48+16]
mov bl,[rax]
mov [rcx+5], bl
mov rax, qword [rbp+48+24]
mov bl,[rax]
mov [rcx+6], bl
mov rax, qword [rbp+48+32]
mov bl,[rax]
mov [rcx+7], bl
mov rax, qword [rbp+48+40]
mov bl,[rax]
mov [rcx+8], bl
mov rax, qword [rbp+48+48]
mov bl,[rax]
mov [rcx+9], bl
mov bl,0 ; terminating zero
mov [rcx+10], bl
leave
ret
Additional info:
I cannot look at register values just after line 50 which
corresponds to "XOR RAX, RAX" in lfunc because debugger auto skips
single stepping to line 37 of main function which corresponds to
"add RSP, 32+8". Even If I marked breakpoints in between
aforementioned lines in lfunc code the debugger simply hangs so I
have to manually abort debugging.
In portion ";arguments on stack"
mov rax, qword [rbp+8+8+32]
mov bl,[rax]
I am mentioning this again to be more precise of what am asking because question was marked as duplicate and
provided links with answers that doesn't address my specific issue. At line
[rbp+8+8+32] == 0x44 because clearly, mov with square brackets dereferences reference address (which I assume 64bit width) rbp+3h. So, the size of 0x44 is byte. That is why ask "Why qword?" because it implies "lea [rbp+8+8+32]" which is a qword reference, not mov. So if [rbp+8+8+32] equals 0x44, then [rax] == [0x0000000000000044], which a garbage ( not relevant to our code here) address.

How to replace a store of EAX with a store of an immediate constant?

From my previous question, I asked how to change the nation code to what I needed it to be. I explored in the disassembly more and I found out exactly where I needed this change to be. In other files, the code seems to be:
mov ds:dword_73A9C8, 1
Where the file I'm trying to edit has it like
mov ds:dword_73A9C8, eax
I've tried to edit the file in IDA by hex to match it to the first line of code, however, the function, even after extending its length, seems to break each time I edit it.
The question I have is how can I change it from having eax being moved to having 1 being moved without breaking the function
sub_4A2B60 proc near
arg_0= dword ptr 4
mov eax, [esp+arg_0]
mov ds:dword_73A9C8, eax
retn
sub_4A2B60 endp
You could replace the 4 byte instruction mov eax, [esp + 4] with the sequence xor eax, eax inc eax nop that also has 4 bytes.
If 1 is what you want, then the return value in EAX should probably also be 1.

Multicore in NASM Windows: threads execute randomly

I have code in NASM (64 bit) in Windows to run four simultaneous threads (each assigned to a separate core) on a four-core Windows x86-64 machine.
The threads are created in a loop. After thread creation, it calls WaitForMultipleObjects to coordinate the threads. The function to call is Test_Function (see code below).
Each thread (core) executes Test_Function across a large array. The first core starts at data element zero, the second core starts at 1, the third core starts at 2, the fourth core starts at 3, and each core increments by four (e.g., 0, 4, 8, 12).
In Test_Function I created a small test program that writes one of the input data values to the location corresponding to its startbyte, to verify that I have successfully created four threads and they return the correct data.
Each thread should write the stride value (32), but the test shows that the four fields are filled in randomly, with some fields showing as zero. If I repeat the test multiple times, I see there is no consistency to which fields will have the value 32 (the others always show as 0). That could be a side effect of WaitForMultipleObjects, but I haven't seen anything in the docs to confirm that.
Also, WaitForMultipleObjects waits on the ThreadHandles returned by CreateThread; when I examine the ThreadHandles array, it always shows like this: 268444374, 32, 1652, 1584. Only the first element looks like the size of a handle, the others do not look like handle values.
One possibility is that the two parameters passed on the stack may not be in the correct locations:
mov rax,0
mov [rsp+40],rax ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax ; ThreadID
According to the docs, ThreadCount should be a pointer. When I change the line to mov rax,ThreadCount (the pointer value), the program crashes. When I change it to:
mov rax,0
mov [rsp+32],rax ; use default creation flags
mov rax,ThreadCount
mov [rsp+40],rax ; ThreadID
now it reliably processes the first thread, but not threads 2-4.
So the bottom line is the threads are being created but they execute randomly, with some threads not executing at all, in no particular order. When I change the CreateThread parameters (as shown above) the first thread executes, but not threads 2-4.
Here is the test code showing the relevant parts. If a reproducible example is needed, I can prepare one.
Thanks for any ideas.
Init_Cores_fn:
; EACH OF THE CORES CALLS Test_Function AND EXECUTES THE WHOLE PROGRAM.
; WE PASS THE STARTING BYTE (0, 8, 16, 24) AND THE "STRIDE" = NUMBER OF CORES.
; ON RETURN, WE SYNCHRONIZE ANY DATA. ON ENTRY TO EACH CORE, SET THE REGISTERS
; Populate the ThreadInfo array with vars to pass
; ThreadInfo: length, startbyte, stride, vars into registers on entry to each core
mov rdi,ThreadInfo
mov rax,ThreadInfoLength
mov [rdi],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Register Vars
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
mov rbp,rsp ; preserve caller's stack frame
sub rsp,56 ; Shadow space
; _____
label_0:
mov rdi,ThreadInfo
mov rax,[FirstByte]
mov [rdi+8],rax ; 0, 8, 16, or 24
; _____
; Create Threads
mov rcx,0 ; lpThreadAttributes (Security Attributes)
mov rdx,0 ; dwStackSize
mov r8,Test_Function ; lpStartAddress (function pointer)
mov r9,ThreadInfo ; lpParameter (array of data passed to each core)
mov rax,0
mov [rsp+40],rax ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax ; ThreadID
call CreateThread
; Move the handle into ThreadHandles array (returned in rax)
mov rdi,ThreadHandles
mov rcx,[FirstByte]
mov [rdi+rcx],rax
mov rax,[FirstByte]
add rax,8
mov [FirstByte],rax
mov rax,[ThreadCount]
add rax,1
mov [ThreadCount],rax
mov rbx,4
cmp rax,rbx
jl label_0
; _____
; Wait
mov rcx,rax ; number of handles
mov rdx,ThreadHandles ; pointer to handles array
mov r8,1 ; wait for all threads to complete
mov r9,1000 ; milliseconds to wait
call WaitForMultipleObjects
; _____
;[ Code HERE to do cleanup if needed after the four threads finish ]
mov rsp,rbp
jmp label_900
; __________________
; The function for all threads to call
Test_Function:
; Populate registers
mov rdi,rcx
mov rax,[rdi]
mov r15,[rdi+24]
mov rax,[rdi+8] ; start byte
mov r13,[rdi+40]
mov r12,[rdi+48]
mov r10,[rdi+56]
xor r11,r11
xor r9,r9
pxor xmm15,xmm15
pxor xmm15,xmm14
pxor xmm15,xmm13
; Now test it - BUT the first thread does not write data
mov rcx,[rdi+8] ; start byte
mov rax,[rdi+16] ; stride
cvtsi2sd xmm0,rax
movsd [r15+rcx],xmm0
ret
I solved this problem, and here is the solution. Raymond Chen alluded to this in the comments above before urging me to use a higher level language, but I didn't understand it until today. I am posting this answer so it's easily accessible and understood by anyone who has the same problem in assembly language (or any other language) in the future because Raymond's comment (which I just upvoted) is now buried in the other comments above.
The ThreadInfo array, passed here as the fourth parameter to CreateThread (in r9 for Windows). Each core must have its own separate copy of ThreadInfo. In my application, the data in ThreadInfo are all the same except for the StartByte parameter (at rdi+8). Instead, I created a separate ThreadInfo array for each core (ThreadInfo1, 2, 3, and 4) and pass a pointer to the corresponding ThreadInfo array.
I implemented it in my application as a call to the following dup function but it could be implemented other ways as well:
DupThreadInfo:
mov rdi,ThreadInfo2
mov rax,8
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
; _____
mov rdi,ThreadInfo3
mov rax,0
mov [rdi],rax ; length (number of vars into registers plus 3 elements)
mov rax,16
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
mov rdi,ThreadInfo4
mov rax,0
mov [rdi],rax ; length (number of vars into registers plus 3 elements)
mov rax,24
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
ret
Because all data in the ThreadInfo arrays are the same except the second element, a more efficient way to do this would be to pass a 2-element array where the first element is the StartByte and the second element is a pointer to the static ThreadInfo array. That's especially important when we are working with more than four cores because the DupThreadInfo section would be needlessly long. That solution would avoid a call, but I haven't implemented that yet.

Flipping the first pixel in an image in asm

Hi I am just doing this for practice before I create a loop that can flip an 3x3 image horizontally or vertically. I am using a variable called ap to store the addresses of the first pixel. I would also like to eventually use another variable called amp to store the mirrored pixel address, and also a register to store the calculated offset of the pixels but for now I put it in manually. No matter what I do the program doesn't swap them. Does anyone have an idea of what is the issue? Thank you for reading.
mov ecx, dword ptr[eax + ecx * 4]
mov ap, ecx //temporary pixel address storage
mov ecx, 0
mov ecx, dword ptr[eax + ecx * 4 + 8] //offset by 8 pixels
mov [ap], ecx
I am using a variable called ap to store the addresses of the first pixel
If the ap variable is suppossed to contain an addresss than you need to use the lea instruction (not the mov instruction).
; For the 1st line EAX is address of image = address of 1st pixel
mov ecx, 0 ;Index to 1st pixel
lea ecx, dword ptr[eax + ecx * 4] ;Address of 1st pixel
mov [ap], ecx
mov ecx, 2 ;Index to 3rd pixel
lea ecx, dword ptr[eax + ecx * 4] ;Address of 3rd pixel
mov [amp], ecx
Now to swap these pixels and thus flipping the image you can write:
mov ecx, [ap]
mov edx, [amp]
mov [ap], edx
mov [amp], ecx
To proces the next lines of the image you could each time add the number of bytes per scanline to the EAX register. For an 3x3 image that's probably 12.
I don't get what this suppose to do:
mov ecx, dword ptr[eax + ecx * 4]
whats in ecx? is it a counter for offset? but you are overriding it each time...
If you'r trying to save the original bit i think that you need to make sure you got the right value in ecx. try
mov ecx, 0
first (you can also xor ecx, ecx it gets the job done and its easier to read)

Code Optimization Tips:

I am using the following ASM routine to bubble sort an array. I want to know of the inefficiencies of my code:
.386
.model flat, c
option casemap:none
.code
public sample
sample PROC
;[ebp+0Ch]Length
;[ebp+08h]Array
push ebp
mov ebp, esp
push ecx
push edx
push esi
push eax
mov ecx,[ebp+0Ch]
mov esi,[ebp+08h]
_bubbleSort:
push ecx
push esi
cmp ecx,1
je _exitLoop
sub ecx,01h
_miniLoop:
push ecx
mov edx,DWORD PTR [esi+4]
cmp DWORD PTR [esi],edx
ja _swap
jmp _continueLoop
_swap:
lodsd
mov DWORD PTR [esi-4],edx
xchg DWORD PTR [esi],eax
jmp _skipIncrementESI
_continueLoop:
add esi,4
_skipIncrementESI:
pop ecx
loop _miniLoop
_exitLoop:
pop esi
pop ecx
loop _bubbleSort
pop eax
pop esi
pop edx
pop ecx
pop ebp
ret
sample ENDP
END
Basically I have two loops, as usual for the bubble sort algorithm. The value of ecx for the outer loop is 10, and for the inner loop it is [ecx-1]. I have tried the routine and it compiles and runs successfully, but I am not sure if it is efficient.
There are several things you can do to speed up your assembly code:
don't do things like ja label_1 ; jmp label_2. Just do jbe label_2 instead.
loop is a very slow instruction. dec ebx; jnz loopstart is much faster
use all registers instead of repeatedly push/pop ecx and esi. Use ebx and edi too.
jmp-targets should be well aligned. Use align 4 before the two loop-starts and after the jbe
Get yourself a manual for your cpu from Intel (you can download it as pdf), it has the timings for the opcodes, maybe it has other hints too.
Several simple tips:
1) Try to minimize the number of conditional jumps, because they are very expensive. Unroll if possible.
2) Reorder instructions to minimize stalls because of data depencency:
cmp DWORD PTR [esi],edx ;// takes some time to compute,
mov edx,DWORD PTR [esi+4] ;
ja _swap ;// waits for results of cmp
3) Avoid old composite instructions (dec, jnz pair is faster than loop and is not bound to ecx register)
It would be quite difficult to write assembly code that is faster than the code generated by optimizing C compiler, because you should consider lots of factors: size of data and instruction caches, alignments, pipeline, instruction timings. You can find some good documentation about this here. I especially recommend the first book: Optimizing software in C++
Substitute for "add esi,4" if we do not need a flag for this instruction:
_continueLoop:
lea esi,[esi+4]

Resources