LOOP is done only one time when single stepping in Turbo Debugger - debugging

The code must output 'ccb',but output only 'c', LOOP is done only one time, i have calibrated in TD, but why LOOP is done only one time?
I THINK THAT I MUST TO DECREMENT STRING_LENGTH, SO I WROTE
DEC STRING_LENGTH
BUT IT NOT WORK, SO I WROTE LIKE THAT
MOV SP,STRING_LENGTH
DEC SP
MOV STRING_LENGTH,SP
I KNOW WHAT ARE YOU THINKING RIGHT NOW, THAT IS SO INCORRECT, YOU ARE RIGHT)))
I CAN USE C++, BUT I WANT TO DO IT ONLY IN ASSEMBLY,
DOSSEG
.MODEL SMALL
.STACK 200H
.DATA
STRING DB 'cScbd$'
STRING_LENGTH EQU $-STRING
STRING1 DB STRING_LENGTH DUP (?) , '$'
.CODE
MOV AX,#DATA
MOV DS,AX
XOR SI,SI
XOR DI,DI
MOV CX,STRING_LENGTH
S:
MOV BL,STRING[DI]
AND STRING[DI],01111100B
CMP STRING[DI],01100000B
JNE L1
MOV AL,BL
MOV STRING1[SI],AL
ADD SI,2
L1:
ADD DI,2
LOOP S
MOV DL,STRING1
MOV AH,9
INT 21H
MOV AH,4CH
INT 21H
END

In Turbo Debugger (TD.EXE) the F8 "F8 step" will execute the loop completely, until the cx becomes zero (you can even create infinite loop by updating cx back to some value, preventing it from reaching the 1 -> 0 step).
To get "single-step" out of the loop instruction, use the F7 "F7 trace" - that will cause the cx to go from 6 to 5, and the code pointer will follow the jump back on the start of the loop.
About some other issues of your code:
MOV SP,STRING_LENGTH
DEC SP
MOV STRING_LENGTH,SP
sp is not general purpose register, don't use it for calculation like this. Whenever some instruction does use stack implicitly (push, pop, call, ret, ...), the values are being written and read in memory area addressed by the ss:sp register pair, so by manipulating the sp value you are modifying the current "stack".
Also in 16 bit x86 real mode all the interrupts (keyboard, timer, ...), when they occur, the current state of flag register and code address is stored into stack, before giving the control to the interrupt handler code, which usually will push additional values to the stack, so whatever is in memory on addresses below current ss:sp is not safe in 16 bit x86 real mode, and the memory content keeps "randomly" changing there by all the interrupts being executed meanwhile (the TD.EXE itself does use part of this stack memory after every single step).
For arithmetic use other registers, not sp. Once you will know enough about "stack", you will understand what kind of sp manipulation is common and why (like sub sp,40 at beginning of function which needs additional "local" memory space), and how to restore stack back into expected state.
One more thing about that:
MOV SP,STRING_LENGTH
DEC SP
MOV STRING_LENGTH,SP
The STRING_LENGTH is defined by EQU, which makes it compile time constant, and only compile time. It's not "variable" (memory allocation), contrary to the things like someLabel dw 1345, which cause the assembler to emit two bytes with values 0100_0001B, 0000_0101B (when read as 16 bit word in little-endian way, that's value 1345 encoded), and the first byte address has symbolic name someLabel, which can be used in further instructions, like dec word ptr [someLabel] to decrement that value in memory from 1345 to 1344 during runtime.
But EQU is different, it assigns the symbol STRING_LENGTH final value, like 14.
So your code can be read as:
mov sp,14 ; makes almost sense, (practically destroys stack setup)
dec sp ; still valid
mov 14,sp ; doesn't make any sense, constant can't be destination for MOV

Related

Visual Studio (MASM) assembly - why does code in labels automatically execute, even if label not called

So I have this code and both labels are being executed, even though I was under the impression they would only execute if called with a jmp instruction
In other words, the output of this code is 15 - i.e. 5 + 7 + 3, while I thought it should be 5, since the labels aren't being called via the jmp instruction
.data
.code
TestNew proc
mov rax, 5
lbl1:
add rax, 7
lbl2:
add rax, 3
ret
TestNew endp
end
It seems the jmp instruction is working, since if I call it e.g. here, I get an infinite loop:
.data
.code
TestNew proc
mov rax, 5
lbl1:
add rax, 7
lbl2:
add rax, 3
jmp lbl1 ;causes infinite loop...so at least jmp is working
ret
TestNew endp
end
If anyone could give me any tips on how to get this working, I'd appreciate it.
Thanks
even though I was under the impression they would only execute if called with a jmp instruction
Sorry, your impression is mistaken. Labels are not analogous to functions or subroutines in a higher-level language. They are just, well, labels: they give a human-readable name to a particular address in memory, but do not have any effect on what is actually in that memory.
Labels can be used to mark the beginning of a subroutine or block of code, so that you can call or jump to it by name from elsewhere. But whatever code immediately precedes your subroutine will by default fall through into it, so you normally have to put a jump or return or some similar instruction there if fall-through is not what you want. Likewise, in order to get your subroutine to return, you code an actual ret instruction; merely inserting a label to start the next subroutine would again result in fall-through.
Execution in assembly always flows from one instruction to the next one that follows it in memory, unless the instruction is a jump or call or some other whose purpose is to redirect execution flow. Putting labels between two instructions does not alter that principle in any way.
So yes, your code is always going to execute the mov and then the two adds, since you have not coded any jump instruction that would alter this.

Why does this indirect call (call rax) take >100 cycles average?

Edit: this question was a mistake. I misread the time spent in the subroutine and thought the statement in this routine was taking 15% of the overall time instead of the actual 3%.
I leave the question here because of the high quality of the answers. I am surprised that the computed branch seems to miss even though the destination is always the same address and I will use the suggestions there to investigate.
Original:
This is the key point in my C code, with the annotations from the profiler. The indirect call takes a third of the overall time of a few-hundred-line loop. Based on the overall run time this call statement must be taking at least 100 clock cycles.
Here is the generated code for the last 3 C statements:
00007FFE066FB5EA mov r12,qword ptr [r8+8]
00007FFE066FB5EE and r12d,30000h
00007FFE066FB5F5 dec r12
00007FFE066FB5F8 and r12,qword ptr [r8+20h]
00007FFE066FB5FC sar r12,3Fh
00007FFE066FB600 and r12,qword ptr [r8+10h]
00007FFE066FB604 lea rbx,[validitymask+60h (07FFE06B058E0h)]
00007FFE066FB60B cmove r12,rbx
A *tpopa=AZAPLOC(arg1); tpopa=(A*)((I)tpopa&REPSGN(AC(arg1)&((AFLAG(arg1)&(AFVIRTUAL|AFUNINCORPABLE))-1))); tpopa=tpopa?tpopa:ZAPLOC0; tpopw=(pline&2)?tpopw:tpopa; // monad: w fs dyad: a w if monad, change to w w
00007FFE066FB60F mov rsi,qword ptr [rdx+8]
00007FFE066FB613 and esi,30000h
00007FFE066FB619 dec rsi
00007FFE066FB61C and rsi,qword ptr [rdx+20h]
00007FFE066FB620 sar rsi,3Fh
00007FFE066FB624 and rsi,qword ptr [rdx+10h]
00007FFE066FB628 cmove rsi,rbx
00007FFE066FB62C test r15b,2
00007FFE066FB630 cmove r12,rsi
y=(*actionfn)((J)((I)jt+(REPSGN(SGNIF(pt0ecam,VJTFLGOK1X+(pline>>1)))&(pline|1))),arg1,arg2,fs); // set bit 0, and bit 1 if dyadic, if inplacing allowed by the verb
00007FFE066FB634 mov bl,28h
00007FFE066FB636 sub bl,cl
00007FFE066FB638 shlx rcx,rdi,rbx
00007FFE066FB63D sar rcx,3Fh
00007FFE066FB641 or r15d,1
00007FFE066FB645 and ecx,r15d
00007FFE066FB648 add rcx,r10
00007FFE066FB64B vzeroupper
00007FFE066FB64E call rax
As you can see, there is nothing exotic about the offending statement, except for the call rax. I expect that call to create a pipeline break of maybe 15 clocks, but nothing like what I am seeing. In fact, this code is executed in a loop and most of the time the function address is the same, so I would expect that most of the indirect calls would predict properly. But that's sure not what I am seeing.
The instruction that loads the branch address (which is in the first C statement shown) is executed long before the branch address is used; that is, the address is settled.
Why is this statement/instruction so slow?
Second question: when does the statement end for profiling purposes? If the call rax requires a pipeline break, do all interrupts occurring during that break count against the call rax? That would make sense, but other instructions are still executing then, right?

x86 assembler pushf causes program to exit

I think my real problem is I don't completely understand the stack frame mechanism so I am looking to understand why the following code causes the program execution to resume at the end of the application.
This code is called from a C function which is several call levels deep and the pushf causes program execution to revert back several levels through the stack and completely exit the program.
Since my work around works as expected I would like to know why using the pushf instruction appears to be (I assume) corrupting the stack.
In the routines I usually setup and clean up the stack with :
sub rsp, 28h
...
add rsp, 28h
However I noticed that this is only necessary when the assembly code calls a C function.
So I tried removing this from both routines but it made no difference. SaveFlagsCmb is an assembly function but could easily be a macro.
The code represents an emulated 6809 CPU Rora (Rotate Right Register A).
PUBLIC Rora_I_A ; Op 46 - Rotate Right through Carry A reg
Rora_I_A PROC
sub rsp, 28h
; Restore Flags
mov cx, word ptr [x86flags]
push cx
popf
; Rotate Right the byte and save the FLAGS
rcr byte ptr [q_s+AREG], 1
; rcr only affects Carry. Save the Carry first in dx then
; add 0 to result to trigger Zero and Sign/Neg flags
pushf ; this causes jump to end of program ????
pop dx ; this line never reached
and dx, CF ; Save only Carry Flag
add [q_s+AREG], 0 ; trigger NZ flags
mov rcx, NF+ZF+CF ; Flag Mask NZ
Call SaveFlagsCmb ; NZ from the add and CF saved in dx
add rsp, 28h
ret
Rora_I_A ENDP
However if I use this code it works as expected:
PUBLIC Rora_I_A ; Op 46 - Rotate Right through Carry A reg
Rora_I_A PROC
; sub rsp, 28h ; works with or without this!!!
; Restore Flags
mov ah, byte ptr [x86flags+LSB]
sahf
; Rotate Right the byte and save the FLAGS
rcr byte ptr [q_s+AREG], 1
; rcr only affects Carry. Save the Carry first in dx then
; add 0 to result to trigger Zero and Sign/Neg flags
lahf
mov dl, ah
and dx, CF ; Save only Carry Flag
add [q_s+AREG], 0 ; trigger NZ flags
mov rcx, NF+ZF+CF ; Flag Mask NZ
Call SaveFlagsCmb ; NZ from the add and CF saved in dx
; add rsp, 28h ; works with or without this!!!
ret
Rora_I_A ENDP
Your reported behaviour doesn't really make sense. Mostly this answer is just providing some background not a real answer, and a suggestion not to use pushf/popf in the first place for performance reasons.
Make sure your debugging tools work properly and aren't being fooled by something into falsely showing a "jump" to somewhere. (And jump where exactly?)
There's little reason to mess around with 16-bit operand size, but that's probably not your problem.
In Visual Studio / MASM, apparently (according to OP's comment) pushf assembles as pushfw, 66 9C which pushes 2 bytes. Presumably popf also assembles as popfw, only popping 2 bytes into FLAGS instead of the normal 8 bytes into RFLAGS. Other assemblers are different.1
So your code should work. Unless you're accidentally setting some other bit in FLAGS that breaks execution? There are bits in EFLAGS/RFLAGS other than condition codes, including the single-step TF Trap Flag: debug exception after every instruction.
We know you're in 64-bit mode, not 32-bit compat mode, otherwise rsp wouldn't be a valid register. And running 64-bit machine code in 32-bit mode wouldn't explain your observations either.
I'm not sure how that would explain pushf being a jump to anywhere. pushf itself can't fault or jump, and if popf set TF then the instruction after popf would have caused a debug exception.
Are you sure you're assembling 64-bit machine code and running it in 64-bit mode? The only thing that would be different if a CPU decoded your code in 32-bit mode should be the REX prefix on sub rsp, 28h, and the RIP-relative addressing mode on [x86flags] decoding as absolute (which would presumably fault). So I don't think that could explain what you're seeing.
Are you sure you're single-stepping by instructions (not source lines or C statements) with a debugger to test this?
Use a debugger to look at the machine code as you single-step. This seem really weird.
Anyway, it seems like a very low-performance idea to use pushf / popf at all, and also to be using 16-bit operand-size creating false dependencies.
e.g. you can set x86 CF with movzx ecx, word ptr [x86flags] / bt ecx, CF.
You can capture the output CF with setc cl
Also, if you're going to do multiple things to the byte from the guest memory, load it into an x86 register. A memory-destination RCR and a memory-destination ADD are unnecessarily slow vs. load / rcr / ... / test reg,reg / store.
LAHF/SAHF may be useful, but you can also do without them too for many cases. popf is quite slow (https://agner.org/optimize/) and it forces a round trip through memory. However, there is one condition-code outside the low 8 in x86 FLAGS: OF (signed overflow). asm-source compatibility with 8080 is still hurting x86 in 2019 :(
You can restore OF from a 0/1 integer with add al, 127: if AL was originally 1, it will overflow to 0x80, otherwise it won't. You can then restore the rest of the condition codes with SAHF. You can extract OF with seto al. Or you can just use pushf/popf.
; sub rsp, 28h ; works with or without this!!!
Yes of course. You have a leaf function that doesn't use any stack space.
You only need to reserve another 40 bytes (align the stack + 32 bytes of shadow space) if you were going to make another function call from this function.
Footnote 1: pushf/popf in other assemblers:
In NASM, pushf/popf default to the same width as other push/pop instructions: 8 bytes in 64-bit mode. You get the normal encoding without an operand-size prefix. (https://www.felixcloutier.com/x86/pushf:pushfd:pushfq)
Like for integer registers, both 16 and 64-bit operand-size for pushf/popf are encodeable in 64-bit mode, but 32-bit operand size isn't.
In NASM, your code would be broken because push cx / popf would push 2 bytes and pop 8, popping 6 bytes of your return address into RFLAGS.
But apparently MASM isn't like that. Probably a good idea to use explicit operand-size specifiers anyway, like pushfw and popfw if you use it at all, to make sure you get the 66 9C encoding, not just 9C pushfq.
Or better, use pushfq and pop rcx like a normal person: only write to 8 or 16-bit partial registers when you need to, and keep the stack qword-aligned. (16-byte alignment before call, 8-byte alignment always.)
I believe this is a bug in Visual Studio. I'm using 2022, so it's an issue that's been around for a while.
I don't know exactly what is triggering it, however stepping over one specific pushf in my code had the same symptoms, albeit with the code actually working.
Putting a breakpoint on the line after the pushf did break, and allowed further debugging of my app. Adding a push ax, pop ax before the pushf also seemed to fix the issue. So it must be a Visual Studio issue.
At this point I think MASM and debugging in Visual Studio has pretty much been abandoned. Any suggestions for alternatives for developing dlls on Windows would be appreciated!

Reverse Engineering: changing AL register without overwriting instructions

I am trying to learn more about reverse engineering by debugging and patching a 64 bit windows executable. I am using x64dbg (Much like ollydbg but with 64 bit support)
I have some assembly that looks roughly like this:
call test_exe.44AA9EB20
mov byte ptr ds:[44AB9DA15], al
[More instructions...]
[More instructions...]
[More instructions...]
the function call in the first line sets the rax register to have a value of 0. Therefore, the second line is moves a value of 0 into the pointer at 44AB9DA15.
I want to reassemble some code so that a value of 1 gets put into this pointer.
I can do something like this:
call test_exe.44AA9EB20
mov byte ptr ds:[44AB9DA15], 1
However, since al is only an 8 bit register, assembling the code to the above seems to run over some of the subsequent instructions.
I know that I can solve this problem by stepping into the function call test_exe.44AA9EB20 and setting rax to have a value of 1 before the ret instruction, but I am curious if there is an easier way.
Is there some way I can give this pointer (44AB9DA15) a value of 1 without running over subsequent instructions?
You want to replace MOV [0x000000044AB9DA15],AL which is encoded as 88042515DAB94A (7 bytes)
with MOV BYTE PTR [0x000000044AB9DA15],1 which is encoded as C6042515DAB94A01 (one byte longer).
Try to use RIP-relative encoding. First calculate the difference between the target pointer and the offset
of following instruction ($+instruction_size). If it is less than 2GB, for instance 0x11223344,
the encoding of MOV BYTE PTR REL [0x000000044AB9DA15-$-7] will be C6054433221101 (exactly 7 bytes).
Or, if test_exe doesn't have to be called, overwrite the CALL instruction with code which sets AL to 1,
e.g. MOV AL,1, and pad the remaining four bytes with NOP.

Malloc and Free multiple arrays in assembly

I'm trying to experiment with malloc and free in assembly code (NASM, 64 bit).
I have tried to malloc two arrays, each with space for 2 64 bit numbers. Now I would like to be able to write to their values (not sure if/how accessing them will work exactly) and then at the end of the whole program or in the case of an error at any point, free the memory.
What I have now works fine if there is one array but as soon as I add another, it fails on the first attempt to deallocate any memory :(
My code is currently the following:
extern printf, malloc, free
LINUX equ 80H ; interupt number for entering Linux kernel
EXIT equ 60 ; Linux system call 1 i.e. exit ()
segment .text
global main
main:
push dword 16 ; allocate 2 64 bit numbers
call malloc
add rsp, 4 ; Undo the push
test rax, rax ; Check for malloc failure
jz malloc_fail
mov r11, rax ; Save base pointer for array
; DO SOME CODE/ACCESSES/OPERATIONS HERE
push dword 16 ; allocate 2 64 bit numbers
call malloc
add rsp, 4 ; Undo the push
test rax, rax ; Check for malloc failure
jz malloc_fail
mov r12, rax ; Save base pointer for array
; DO SOME CODE/ACCESSES/OPERATIONS HERE
malloc_fail:
jmp dealloc
; Finish Up, deallocate memory and exit
dealloc:
dealloc_1:
test r11, r11 ; Check that the memory was originally allocated
jz dealloc_2 ; If not, try the next block of memory
push r11 ; push the address of the base of the array
call free ; Free this memory
add rsp, 4
dealloc_2:
test r12, r12
jz dealloc_end
push r12
call free
add rsp, 4
dealloc_end:
call os_return ; Exit
os_return:
mov rax, EXIT
mov rdi, 0
syscall
I'm assuming the above code is calling the C functions malloc() and free()...
If 1st malloc() fails, you arrive at dealloc_1 with whatever garbage is in r11 and r12 after returning from the malloc().
If 2nd malloc() fails, you arrive at dealloc_1 with whatever garbage is in r12 after returning from the malloc().
Therefore, you have to zero out r11 and r12 before doing the first allocation.
Since this is 64-bit mode, all pointers/addresses and sizes are normally 64-bit. When you pass one of those to a function, it has to be 64-bit. So, push dword 16 isn't quite right. It should be push qword 16 instead. Likewise, when you are removing these parameters from the stack, you have to remove exactly as many bytes as you've put there, so add rsp, 4 must change to add rsp, 8.
Finally, I don't know which registers malloc() and free() preserve and which they don't. You may need to save and restore the so-called volatile registers (see your C compiler documentation). The same holds for the code not shown. It must preserve r11 and r12 so they can be used for deallocation. EDIT: And I'd check if it's the right way of passing parameters through the stack (again, see your compiler documentation).
EDIT: you're testing r11 for 0 right before 2nd free(). It should be r12. But free() doesn't really mind receiving NULL pointers. So, these checks can be removed.
Pay attention to your code.
You have to obey x86-64 calling conventions: arguments might be passed through registers, in the case of malloc that would be RDI for the size. And as already pointed out, you have to watch out which registers are preserved by the called functions. (afaik only RBP, RSP and R12-R15 are preserved across function calls)
There are at least two bugs, because you test r11 again (the line test r11,r11 after dealloc_2:, but you supposedly wanted to test r12 here. Additionally you want to push a qword, if you are in 64 bit mode.
The reason the deallocation doesn't work at all may be because you are changing the contents of r11 or r12.
Not that both tests are not needed, as it is perfectly safe to call free with an null pointer.

Resources