Structured Exception Handler catches near-zero EIP trap differently on nearly identical machines? - windows

I have a rather complex, but extremely well-tested assembly language x86-32 application running on variety of x86-32 and x86-64 boxes. This is a runtime system for a language compiler, so it supports the execution of another compiled binary program, the "object code".
It uses Windows SEH to catch various kinds of traps: division by zero, illegal access, ... and prints a register dump using the context information provided by Windows, that shows the state of the machine at the time of the trap. (It does lots of other stuff irrelevant to the question, such as printing a function backtrace or recovering from the division by zero as appropriate). This allows the writer of the "object code" to get some idea what went wrong with his program.
It behaves differently on two Windows 7-64 systems, that are more or less identical, on what I think is an illegal memory access. The specific problem is that the "object code" (not the well-tested runtime system) somewhere stupidly loads 0x82 into EIP; that is a nonexistent page in the address space AFAIK. I expect a Windows trap though the SEH, and expect to a register dump with EIP=00000082 etc.
On one system, I get exactly that register dump. I could show it here, but it doesn't add anything to my question. So, it is clear the SEH in my runtime system can catch this, and display the situation. This machine does not have any MS development tools on it.
On the other ("mystery") system, with the same exact binaries for runtime system and object code, all I get is the command prompt. No further output. FWIW, this machine has MS Visual Studio 2010 on it. The mystery machine is used heavily for other purposes, and shows no other funny behaviors in normal use.
I assume the behavior difference is caused by a Windows configuration somewhere, or something that Visual Studio controls. It isn't the DEP configuration the system menu; they are both configured (vanilla) as "DEP for standard system processes". And my runtime system executable has "No (/NXCOMPAT:NO)" configured.
Both machines are i7 but different chips, 4 cores, lots of memory, different motherboards. I don't think this is relevant; surely both of these CPUs take traps the same way.
The runtime system includes the following line on startup:
SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX); // stop Windows pop-up on crashes
This was recently added to prevent the "mystery" system from showing a pop-up window, "xxx.exe has stopped working" when the crash occurs. The pop-up box behaviour doesn't happen on the first system, so all this did was push the problem into a different corner on the "mystery" machine.
Any clue where I look to configure/control this?
I provide here the SEH code I am using. It has been edited
to remove a considerable amount of sanity-checking code
that I claim has no effect on the apparant state seen
in this code.
The top level of the runtime system generates a set of worker
threads (using CreateThread) and points to execute ASMGrabGranuleAndGo;
each thread sets up its own SEH, and branches off to a work-stealing scheduler, RunReadyGranule. To the best of my knowledge, the SEH is not changed
after that; at least, the runtime system and the "object code" do
not do this, but I have no idea what the underlying (e.g, standard "C")
libraries might do.
Further down I provide the trap handler, TopLevelEHFilter.
Yes, its possible the register printing machinery itself blows
up causing a second exception. I'll try to check into this again soon,
but IIRC my last attempt to catch this in the debugger on the
mystery machine, did not pass control to the debugger, just
got me the pop up window.
public ASMGrabGranuleAndGo
ASSUME FS:NOTHING ; cancel any assumptions made for this register
ASMGrabGranuleAndGo:
;Purpose: Entry for threads as workers in PARLANSE runtime system.
; Each thread initializes as necessary, just once,
; It then goes and hunts for work in the GranulesQ
; and start executing a granule whenever one becomes available
; install top level exception handler
; Install handler for hardware exceptions
cmp gCompilerBreakpointSet, 0
jne HardwareEHinstall_end ; if set, do not install handler
push offset TopLevelEHFilter ; push new exception handler on Windows thread stack
mov eax, [TIB_SEH] ; expected to be empty
test eax, eax
BREAKPOINTIF jne
push eax ; save link to old exception handler
mov fs:[TIB_SEH], esp ; tell Windows that our exception handler is active for this thread
HardwareEHinstall_end:
;Initialize FPU to "empty"... all integer grains are configured like this
finit
fldcw RTSFPUStandardMode
lock sub gUnreadyProcessorCount, 1 ; signal that this thread has completed its initialization
##: push 0 ; sleep for 0 ticks
call MySleep ; give up CPU (lets other threads run if we don't have enuf CPUs)
lea esp, [esp+4] ; pop arguments
mov eax, gUnreadyProcessorCount ; spin until all other threads have completed initialization
test eax, eax
jne #b
mov gThreadIsAlive[ecx], TRUE ; signal to scheduler that this thread now officially exists
jmp RunReadyGranule
ASMGrabGranuleAndGo_end:
;-------------------------------------------------------------------------------
TopLevelEHFilter: ; catch Windows Structured Exception Handling "trap"
; Invocation:
; call TopLevelEHFilter(&ReportRecord,&RegistrationRecord,&ContextRecord,&DispatcherRecord)
; The arguments are passed in the stack at an offset of 8 (<--NUMBER FROM MS DOCUMENT)
; ESP here "in the stack" being used by the code that caused the exception
; May be either grain stack or Windows thread stack
extern exit :proc
extern syscall #RTSC_PrintExceptionName#4:near ; FASTCALL
push ebp ; act as if this is a function entry
mov ebp, esp ; note: Context block is at offset ContextOffset[ebp]
IF_USING_WINDOWS_THREAD_STACK_GOTO unknown_exception, esp ; don't care what it is, we're dead
; *** otherwise, we must be using PARLANSE function grain stack space
; Compiler has ensured there's enough room, if the problem is a floating point trap
; If the problem is illegal memory reference, etc,
; there is no guarantee there is enough room, unless the application is compiled
; with -G ("large stacks to handle exception traps")
; check what kind of exception
mov eax, ExceptionRecordOffset[ebp]
mov eax, ExceptionRecord.ExceptionCode[eax]
cmp eax, _EXCEPTION_INTEGER_DIVIDE_BY_ZERO
je div_by_zero_exception
cmp eax, _EXCEPTION_FLOAT_DIVIDE_BY_ZERO
je float_div_by_zero_exception
jmp near ptr unknown_exception
float_div_by_zero_exception:
mov ebx, ContextOffset[ebp] ; ebx = context record
mov Context.FltStatusWord[ebx], CLEAR_FLOAT_EXCEPTIONS ; clear any floating point exceptions
mov Context.FltTagWord[ebx], -1 ; Marks all registers as empty
div_by_zero_exception: ; since RTS itself doesn't do division (that traps),
; if we get *here*, then we must be running a granule and EBX for granule points to GCB
mov ebx, ContextOffset[ebp] ; ebx = context record
mov ebx, Context.Rebx[ebx] ; grain EBX has to be set for AR Allocation routines
ALLOCATE_2TOK_BYTES 5 ; 5*4=20 bytes needed for the exception structure
mov ExceptionBufferT.cArgs[eax], 0
mov ExceptionBufferT.pException[eax], offset RTSDivideByZeroException ; copy ptr to exception
mov ebx, ContextOffset[ebp] ; ebx = context record
mov edx, Context.Reip[ebx]
mov Context.Redi[ebx], eax ; load exception into thread's edi
GET_GRANULE_TO ecx
; This is Windows SEH (Structured Exception Handler... see use of Context block below!
mov eax, edx
LOOKUP_EH_FROM_TABLE ; protected by DelayAbort
TRUST_JMP_INDIRECT_OK eax
mov Context.Reip[ebx], eax
mov eax, ExceptionContinueExecution ; signal to Windows: "return to caller" (we've revised the PC to go to Exception handler)
leave
ret
TopLevelEHFilter_end:
unknown_exception:
<print registers, etc. here>

"DEP for standard system processes" won't help you; it's internally known as "OptIn". What you need is the IMAGE_DLLCHARACTERISTICS_NX_COMPAT flag set in the PE header of your .exe file. Or call the SetProcessDEPPolicy function in kernel32.dll The SetProcessMitigationPolicy would be good also... but it isn't available until Windows 8.
There's some nice explanation on Ed Maurer's blog, which explains both how .NET uses DEP (which you won't care about) but also the system rules (which you do).
BIOS settings can also affect whether hardware NX is available.

Related

x86 assembler pushf causes program to exit

I think my real problem is I don't completely understand the stack frame mechanism so I am looking to understand why the following code causes the program execution to resume at the end of the application.
This code is called from a C function which is several call levels deep and the pushf causes program execution to revert back several levels through the stack and completely exit the program.
Since my work around works as expected I would like to know why using the pushf instruction appears to be (I assume) corrupting the stack.
In the routines I usually setup and clean up the stack with :
sub rsp, 28h
...
add rsp, 28h
However I noticed that this is only necessary when the assembly code calls a C function.
So I tried removing this from both routines but it made no difference. SaveFlagsCmb is an assembly function but could easily be a macro.
The code represents an emulated 6809 CPU Rora (Rotate Right Register A).
PUBLIC Rora_I_A ; Op 46 - Rotate Right through Carry A reg
Rora_I_A PROC
sub rsp, 28h
; Restore Flags
mov cx, word ptr [x86flags]
push cx
popf
; Rotate Right the byte and save the FLAGS
rcr byte ptr [q_s+AREG], 1
; rcr only affects Carry. Save the Carry first in dx then
; add 0 to result to trigger Zero and Sign/Neg flags
pushf ; this causes jump to end of program ????
pop dx ; this line never reached
and dx, CF ; Save only Carry Flag
add [q_s+AREG], 0 ; trigger NZ flags
mov rcx, NF+ZF+CF ; Flag Mask NZ
Call SaveFlagsCmb ; NZ from the add and CF saved in dx
add rsp, 28h
ret
Rora_I_A ENDP
However if I use this code it works as expected:
PUBLIC Rora_I_A ; Op 46 - Rotate Right through Carry A reg
Rora_I_A PROC
; sub rsp, 28h ; works with or without this!!!
; Restore Flags
mov ah, byte ptr [x86flags+LSB]
sahf
; Rotate Right the byte and save the FLAGS
rcr byte ptr [q_s+AREG], 1
; rcr only affects Carry. Save the Carry first in dx then
; add 0 to result to trigger Zero and Sign/Neg flags
lahf
mov dl, ah
and dx, CF ; Save only Carry Flag
add [q_s+AREG], 0 ; trigger NZ flags
mov rcx, NF+ZF+CF ; Flag Mask NZ
Call SaveFlagsCmb ; NZ from the add and CF saved in dx
; add rsp, 28h ; works with or without this!!!
ret
Rora_I_A ENDP
Your reported behaviour doesn't really make sense. Mostly this answer is just providing some background not a real answer, and a suggestion not to use pushf/popf in the first place for performance reasons.
Make sure your debugging tools work properly and aren't being fooled by something into falsely showing a "jump" to somewhere. (And jump where exactly?)
There's little reason to mess around with 16-bit operand size, but that's probably not your problem.
In Visual Studio / MASM, apparently (according to OP's comment) pushf assembles as pushfw, 66 9C which pushes 2 bytes. Presumably popf also assembles as popfw, only popping 2 bytes into FLAGS instead of the normal 8 bytes into RFLAGS. Other assemblers are different.1
So your code should work. Unless you're accidentally setting some other bit in FLAGS that breaks execution? There are bits in EFLAGS/RFLAGS other than condition codes, including the single-step TF Trap Flag: debug exception after every instruction.
We know you're in 64-bit mode, not 32-bit compat mode, otherwise rsp wouldn't be a valid register. And running 64-bit machine code in 32-bit mode wouldn't explain your observations either.
I'm not sure how that would explain pushf being a jump to anywhere. pushf itself can't fault or jump, and if popf set TF then the instruction after popf would have caused a debug exception.
Are you sure you're assembling 64-bit machine code and running it in 64-bit mode? The only thing that would be different if a CPU decoded your code in 32-bit mode should be the REX prefix on sub rsp, 28h, and the RIP-relative addressing mode on [x86flags] decoding as absolute (which would presumably fault). So I don't think that could explain what you're seeing.
Are you sure you're single-stepping by instructions (not source lines or C statements) with a debugger to test this?
Use a debugger to look at the machine code as you single-step. This seem really weird.
Anyway, it seems like a very low-performance idea to use pushf / popf at all, and also to be using 16-bit operand-size creating false dependencies.
e.g. you can set x86 CF with movzx ecx, word ptr [x86flags] / bt ecx, CF.
You can capture the output CF with setc cl
Also, if you're going to do multiple things to the byte from the guest memory, load it into an x86 register. A memory-destination RCR and a memory-destination ADD are unnecessarily slow vs. load / rcr / ... / test reg,reg / store.
LAHF/SAHF may be useful, but you can also do without them too for many cases. popf is quite slow (https://agner.org/optimize/) and it forces a round trip through memory. However, there is one condition-code outside the low 8 in x86 FLAGS: OF (signed overflow). asm-source compatibility with 8080 is still hurting x86 in 2019 :(
You can restore OF from a 0/1 integer with add al, 127: if AL was originally 1, it will overflow to 0x80, otherwise it won't. You can then restore the rest of the condition codes with SAHF. You can extract OF with seto al. Or you can just use pushf/popf.
; sub rsp, 28h ; works with or without this!!!
Yes of course. You have a leaf function that doesn't use any stack space.
You only need to reserve another 40 bytes (align the stack + 32 bytes of shadow space) if you were going to make another function call from this function.
Footnote 1: pushf/popf in other assemblers:
In NASM, pushf/popf default to the same width as other push/pop instructions: 8 bytes in 64-bit mode. You get the normal encoding without an operand-size prefix. (https://www.felixcloutier.com/x86/pushf:pushfd:pushfq)
Like for integer registers, both 16 and 64-bit operand-size for pushf/popf are encodeable in 64-bit mode, but 32-bit operand size isn't.
In NASM, your code would be broken because push cx / popf would push 2 bytes and pop 8, popping 6 bytes of your return address into RFLAGS.
But apparently MASM isn't like that. Probably a good idea to use explicit operand-size specifiers anyway, like pushfw and popfw if you use it at all, to make sure you get the 66 9C encoding, not just 9C pushfq.
Or better, use pushfq and pop rcx like a normal person: only write to 8 or 16-bit partial registers when you need to, and keep the stack qword-aligned. (16-byte alignment before call, 8-byte alignment always.)
I believe this is a bug in Visual Studio. I'm using 2022, so it's an issue that's been around for a while.
I don't know exactly what is triggering it, however stepping over one specific pushf in my code had the same symptoms, albeit with the code actually working.
Putting a breakpoint on the line after the pushf did break, and allowed further debugging of my app. Adding a push ax, pop ax before the pushf also seemed to fix the issue. So it must be a Visual Studio issue.
At this point I think MASM and debugging in Visual Studio has pretty much been abandoned. Any suggestions for alternatives for developing dlls on Windows would be appreciated!

Direction Flag DF in the Windows 64-bit calling convention?

According to the docs I can find on calling windows functions, the following applies:-
The Microsoft x64 calling convention[12][13] is followed on Windows
and pre-boot UEFI (for long mode on x86-64). It uses registers RCX,
RDX, R8, R9 for the first four integer or pointer arguments (in that
order), and additional arguments are pushed onto the stack (right to
left). Integer return values (similar to x86) are returned in RAX if
64 bits or less.
In the Microsoft x64 calling convention, it's the caller's
responsibility to allocate 32 bytes of "shadow space" on the stack
right before calling the function (regardless of the actual number of
parameters used), and to pop the stack after the call. The shadow
space is used to spill RCX, RDX, R8, and R9,[14] but must be made
available to all functions, even those with fewer than four
parameters.
The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile
(caller-saved).[15]
The registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are
considered nonvolatile (callee-saved).[15]
So, I have been happily calling kernel32 until a call to GetEnvironmentVariableA failed under certain circumstances. I finally traced it back to the fact that the direction flag DF was set and I needed to clear it.
I have not up till now been able to find any mention of this and wondered if it was prudent to always clear it before a call.
Or maybe that would cause other problems. Anyone aware of the conventions of calling in this instance?
Windows assumes that the direction flag is cleared. Despite in article said about C run-time only, this is true for whole windows (I think because windows code itself is primarily written in c/c++). So when your programme begins to execute - you can assume that DF is 0. Usually you do not need to change this flag. However if you temporarily change it (set it to 1) in some internal routine you must clear it by cld before calling any windows API or any external module (because it assumes that DF is 0).
All windows interrupts at very beginning of execution clear DF to 0 - so it is safe to temporarily set DF to 1 in own internal code, main - before any external call reset it back to 0.

DOS DEBUG trace command doesn't work as I would expect

I have ASM code which print abc using looping syntax. Here is my code
;abc.com
.model small
.code
org 100h
start:
mov ah, 02h
mov dl, 'a'
mov cx, 3h
ulang:
int 21h
inc dl
loop ulang
int 20h
end start
the COM program run normally
result of debug abc.com followed with -t looks like
The question is why it's NOP after INT 21, instead of INC dl? AFAIK it should INC dl then LOOP xxxx for three times then INT 20.
When I press -t continously it's go somewhere I don't know till crash, means can't find INT 20h
it's different with debug abc.com followed with -u
it's show INC dl and LOOP 0107 which indicate looping.
FYI:
Win 7 Ultimate SP 1 32 Bit
GUI Turbo ASM x86 3.0
Celeron Dual Core n2840
The Trace command in debug is the equivalent of the STEP INTO feature of modern day debuggers. The int instruction (like call) executes a series of instructions and then returns back to the caller. Trace will step into a software interrupt handler or a function and execute each instruction one at a time. The MSDN documentation for debug says this about Trace:
Executes one instruction and displays the contents of all registers, the status of all flags, and the decoded form of the instruction executed.
In your case you hit int 21h and jumped to the software interrupt handlers code at CS:IP 00A7:107C . If you trace through all the interrupt handler code you'd eventually reach CS:IP of 1400:0109 where the INC DL instruction is.
In order to execute a function or interrupt without stepping through each instruction associated with it, you can use the proceed command. Proceed is akin to the STEP OVER feature of modern day debuggers. The code of an interrupt handler or a function/subroutine will execute and then break on the instruction after the INT or CALL instruction.
The documentation says this about PROCEED:
When the p command transfers control from Debug to the program being tested, that program runs without interruption until the loop, repeated string instruction, software interrupt, or subroutine at the specified address is completed, or until the specified number of machine instructions have been executed. Control then returns to Debug.

CPU idle loop without Sleep

I am learning WIN32 ASM right now and I was wondering if there is something like an "idle" infinite loop that doesn't consume any resources at all. Basically I need a running process to experiment with that goes like this:
loop:
; alternative to sleep...
jmp loop
Is there something that may idle the process?
You can't have it both ways. You can either consume the CPU or you can let it do something else. You can conserve power and avoid depriving other cores of resources with rep; nop; (also known as pause), as Vlad Lazarenko suggested. But if you loop without yielding the core to another process, at least that virtual core cannot do anything else.
Note: You should never use empty loops to make the application idle. It will load your processor to 100%
There are several ways to make the application idle, when there is no GUI activity in Win32 environment. The main (documented in MSDN) is to use "GetMessage" function in the main loop in order to extract the messages from the message queue.
When the message queue is empty, this function will idle, consuming very low processor time, waiting for message to arrive in the message queue.
Below is an example, using FASM macro library:
msg_loop:
invoke GetMessage, msg, NULL, 0, 0
cmp eax, 1
jb end_loop
jne msg_loop
invoke TranslateMessage, msg
invoke DispatchMessage, msg
jmp msg_loop
Another approach is used when you want to catch the moment when the application goes to idle state and make some low priority, one-time processing (for example enabling/disabling the buttons on the toolbar, according to the state of the application).
In this case, a combination of PeekMessage and WaitMessage have to be used. PeekMessage function returns immediately, even when the message queue is empty. This way, you can detect this situation and provide some idle tasks to be done, then you have to call WaitMessage in order to idle the process waiting for incoming messages.
Here is an simplified example from my code (using FreshLib macros):
; Main message loop
Run:
invoke PeekMessageA, msg, 0, 0, 0, PM_REMOVE
test eax,eax
jz .empty
cmp [msg.message], WM_QUIT
je .terminate
invoke TranslateMessage, msg
invoke DispatchMessageA, msg
jmp Run
.empty:
call OnIdle
invoke WaitMessage
jmp Run
.terminate:
FinalizeAll
stdcall TerminateAll, 0
What do you mean by "consume resources"?
If you just want a command that does nothing? If so nop will do that, and you can loop as much as you want even: rep; nop. However, the CPU will actually be busy doing work: executing the "no operation" instruction.
If you want an instruction that will cause the CPU itself to stop, then you are sorta-kinda out of luck: although there are ways to do that, you cannot do it from userspace.
With ring 0 access level (like kernel driver), you could use the x86 HLT opcode, but you need to have a system programming skill to really understand how to use it. Using HLT this way requires interrupts to be enabled (not masked) and the guarantee of an interrupt occurring (e.g. system timer), because the return from the interrupt will execute the next instruction after the HLT.
Without ring 0 access you'll never find any x86 opcode to enter an "idle" mode...
You only could find some instructions which consume less power (no memory access, no cache access, no FPU access, low ALU usage...).
Yes, there are some architectures support intrinsic idle state AKA halt.
For example, in x86 hlt, opcode 0xf4. Probably can be called only on privileged mode.
CPU Switches from User mode to Kernel Mode : What exactly does it do? How does it makes this transition?
How to completely suspend the processor?
a Linux's userspace example I found here:
.section .rodata
greeting:
.string "Hello World\n"
.text
_start:
mov $12,%edx /* write(1, "Hello World\n", 12) */
mov $greeting,%ecx
mov $1,%ebx
mov $4,%eax /* write is syscall 4 */
int $0x80
xorl %ebx, %ebx /* Set exit status and exit */
mov $0xfc,%eax
int $0x80
hlt /* Just in case... */

Malloc and Free multiple arrays in assembly

I'm trying to experiment with malloc and free in assembly code (NASM, 64 bit).
I have tried to malloc two arrays, each with space for 2 64 bit numbers. Now I would like to be able to write to their values (not sure if/how accessing them will work exactly) and then at the end of the whole program or in the case of an error at any point, free the memory.
What I have now works fine if there is one array but as soon as I add another, it fails on the first attempt to deallocate any memory :(
My code is currently the following:
extern printf, malloc, free
LINUX equ 80H ; interupt number for entering Linux kernel
EXIT equ 60 ; Linux system call 1 i.e. exit ()
segment .text
global main
main:
push dword 16 ; allocate 2 64 bit numbers
call malloc
add rsp, 4 ; Undo the push
test rax, rax ; Check for malloc failure
jz malloc_fail
mov r11, rax ; Save base pointer for array
; DO SOME CODE/ACCESSES/OPERATIONS HERE
push dword 16 ; allocate 2 64 bit numbers
call malloc
add rsp, 4 ; Undo the push
test rax, rax ; Check for malloc failure
jz malloc_fail
mov r12, rax ; Save base pointer for array
; DO SOME CODE/ACCESSES/OPERATIONS HERE
malloc_fail:
jmp dealloc
; Finish Up, deallocate memory and exit
dealloc:
dealloc_1:
test r11, r11 ; Check that the memory was originally allocated
jz dealloc_2 ; If not, try the next block of memory
push r11 ; push the address of the base of the array
call free ; Free this memory
add rsp, 4
dealloc_2:
test r12, r12
jz dealloc_end
push r12
call free
add rsp, 4
dealloc_end:
call os_return ; Exit
os_return:
mov rax, EXIT
mov rdi, 0
syscall
I'm assuming the above code is calling the C functions malloc() and free()...
If 1st malloc() fails, you arrive at dealloc_1 with whatever garbage is in r11 and r12 after returning from the malloc().
If 2nd malloc() fails, you arrive at dealloc_1 with whatever garbage is in r12 after returning from the malloc().
Therefore, you have to zero out r11 and r12 before doing the first allocation.
Since this is 64-bit mode, all pointers/addresses and sizes are normally 64-bit. When you pass one of those to a function, it has to be 64-bit. So, push dword 16 isn't quite right. It should be push qword 16 instead. Likewise, when you are removing these parameters from the stack, you have to remove exactly as many bytes as you've put there, so add rsp, 4 must change to add rsp, 8.
Finally, I don't know which registers malloc() and free() preserve and which they don't. You may need to save and restore the so-called volatile registers (see your C compiler documentation). The same holds for the code not shown. It must preserve r11 and r12 so they can be used for deallocation. EDIT: And I'd check if it's the right way of passing parameters through the stack (again, see your compiler documentation).
EDIT: you're testing r11 for 0 right before 2nd free(). It should be r12. But free() doesn't really mind receiving NULL pointers. So, these checks can be removed.
Pay attention to your code.
You have to obey x86-64 calling conventions: arguments might be passed through registers, in the case of malloc that would be RDI for the size. And as already pointed out, you have to watch out which registers are preserved by the called functions. (afaik only RBP, RSP and R12-R15 are preserved across function calls)
There are at least two bugs, because you test r11 again (the line test r11,r11 after dealloc_2:, but you supposedly wanted to test r12 here. Additionally you want to push a qword, if you are in 64 bit mode.
The reason the deallocation doesn't work at all may be because you are changing the contents of r11 or r12.
Not that both tests are not needed, as it is perfectly safe to call free with an null pointer.

Resources