GlobalFree() causing user breakpoint... memory block is fixed, not locked, single module, no DLL

GlobalFree() causing user breakpoint... memory block is fixed, not locked, single module, no DLL - winapi

At one point in my program I am calling GlobalFree() to release a memory buffer I allocated with GlobalAlloc() using GMEM_FIXED flag. There is nothing that could be locking this block. However, when I call GlobalFree() after referencing the data (and all of the internal data is still the same as it was), the program stops and says it has encountered a user breakpoint from code in the GlobalFree() code.
Any ideas what could cause this?

The heap functions typically call DebugBreak() - which implements a user-breakpoint - when they detect that the heap structures have been damaged.
The implication is you have written past the end (or the beginning) of the allocated area.

Related

How to free all GPU memory from pytorch.load?

This code fills some GPU memory and doesn't let it go:
def checkpoint_mem(model_name):
checkpoint = torch.load(model_name)
del checkpoint
torch.cuda.empty_cache()
Printing memory with the following code:
print(torch.cuda.memory_reserved(0))
print(torch.cuda.memory_allocated(0))
shows BEFORE running checkpoint_mem:
0
0
and AFTER:
121634816
97332224
This is with torch.__version__ 1.11.0+cu113 on Google colab.
Does torch.load leak memory? How can I get the GPU memory completely cleared?

It probably doesn't. Also, it depends on what you call memory leak. In this case, after the program ends all memory should be freed, python has a garbage collector, so it might not happen immediately (your del or after leaving the scope) like it does in C++ or similar languages with RAII.
del
del is called by Python and only removes the reference (same as when the object goes out of scope in your function).
torch.nn.Module does not implement del, hence its reference is simply removed.
All of the elements within torch.nn.Module have their references removed recursively (so for each CUDA torch.Tensor instance their __del__ is called).
del on each tensor is a call to release memory
More about __del__
Caching allocator
Another thing - caching allocator occupies part of the memory so it doesn't have to rival other apps in need of CUDA when you are going to use it.
Also, I assume PyTorch is loaded lazily, hence you get 0 MB used at the very beginning, but AFAIK PyTorch itself, during startup, reserves some part of CUDA memory.
The short story is given here, longer one here in case you didn’t see it already.
Possible experiments
You may try to run time.sleep(5) after your function and measure afterwards.
You can get snapshot of the allocator state via torch.cuda.memory_snapshot to get more info about allocator’s reserved memory and inner workings.
You might set the environment variable PYTORCH_NO_CUDA_MEMORY_CACHING=1 and see whether and if anything changes.
Disclaimer
Not a CUDA expert by any means, so someone with more insight could probably expand (and/or correct) my current understanding as I am sure way more things happen under the hood.

It is not possible, see here for the same question and the response from PyTorch developer:
https://github.com/pytorch/pytorch/issues/37664

A DLL should free heap memory only if the DLL is unloaded dynamically?

Question Purpose: Reality check on the MS docs of DllMain.
It is "common" knowledge that you shouldn't do too much in DllMain, there are definite things you must never do, some best practises.
I now stumbled over a new gem in the docs, that makes little sense to me: (emph. mine)
When handling DLL_PROCESS_DETACH, a DLL should free resources such as
heap memory only if the DLL is being unloaded dynamically (the
lpReserved parameter is NULL). If the process is terminating (the
lpvReserved parameter is non-NULL), all threads in the process except
the current thread either have exited already or have been explicitly
terminated by a call to the ExitProcess function, which might leave
some process resources such as heaps in an inconsistent state. In this
case, it is not safe for the DLL to clean up the resources. Instead,
the DLL should allow the operating system to reclaim the memory.
Since global C++ objects are cleaned up during DllMain/DETACH, this would imply that global C++ objects must not free any dynamic memory, because the heap may be in an inconsistent state. / When the DLL is "linked statically" to the executable. / Certainly not what I see out there - global C++ objects (iff there are) of various (ours, and third party) libraries allocate and deallocate just fine in their destructors. (Barring other ordering bugs, o.c.)
So, what specific technical problem is this warning aimed at?
Since the paragraph mentions thread termination, could there be a heap corruption problem when some threads are not cleaned up correctly?

The ExitProcess API in general does the follwoing:
Enter Loader Lock critical section
lock main process heap (returned by GetProcessHeap()) via HeapLock(GetProcessHeap()) (ok, of course via RtlLockHeap) (this is very important step for avoid deadlock)
then terminate all threads in process, except current (by call NtTerminateProcess(0, 0) )
then call LdrShutdownProcess - inside this api loader walk by loaded module list and sends DLL_PROCESS_DETACH with lpvReserved nonnull.
finally call NtTerminateProcess(NtCurrentProcess(), ExitCode ) which terminates the process.
The problem here is that threads terminated in arbitrary place. For example, thread can allocate or free memory from any heap and be inside heap critical section, when it terminated. As a result, if code during DLL_PROCESS_DETACH tries to free a block from the same heap, it deadlocks when trying to enter this heap's critical section (if of course heap implementation use it).
Note that this does not affect the main process heap, because we call HeapLock for it before terminate all threads (except current). The purpose of this: We wait in this call until all another threads exit from process heap critical section and after we acquire the critical section, no other threads can enter it - because the main process heap is locked.
So, when we terminate threads after locking the main heap - we can be sure that no other threads that are killed are inside main heap critical section or heap structure in inconsistent state. Thanks to RtlLockHeap call. But this is related only to main process heap. Any other heaps in the process are not locked. So these can be in inconsistent state during DLL_PROCESS_DETACH or can be exclusively acquired by an already terminated thread.
So - using HeapFree for GetProcessHeap or saying LocalFree is safe (however not documented) here.
Using HeapFree for any other heaps is not safe if DllMain is called during process termination.
Also if you use another custom data structures by several threads - it can be in inconsistent state, because another threads (which can use it) terminated in arbitrary point.
So this note is warning that when lpvReserved parameter is non-NULL (what is mean DllMain is called during process termination) you need to be especially careful in clean up the resources. Anyway all internal memory allocations will be free by operation system when process died.

As an addendum to RbMm's excellent answer, I'll add a quote from ExitProcess that does a much better job - than the DllMain docs do - at explaining, why heap operation (or any operation, really) can be compromised:
If one of the terminated threads in the process holds a lock and the
DLL detach code in one of the loaded DLLs attempts to acquire the same
lock, then calling ExitProcess results in a deadlock. In contrast, if
a process terminates by calling TerminateProcess, the DLLs that the
process is attached to are not notified of the process termination.
Therefore, if you do not know the state of all threads in your
process, it is better to call TerminateProcess than ExitProcess. Note
that returning from the main function of an application results in a
call to ExitProcess.
So, it all boils down to: IFF you application has "runaway" threads that may hold any lock, the (CRT) heap lock being a prominent example, you have a big problem during shutdown, when you need to access the same structures (e.g. the heap), that your "runaway" threads are using.
Which just goes to show that you should shut down all your threads in a controlled way.

Is it valid to write below ESP?

For a 32-bit windows application is it valid to use stack memory below ESP for temporary swap space without explicitly decrementing ESP?
Consider a function that returns a floating point value in ST(0). If our value is currently in EAX we would, for example,
PUSH EAX
FLD [ESP]
ADD ESP,4 // or POP EAX, etc
// return...
Or without modifying the ESP register, we could just :
MOV [ESP-4], EAX
FLD [ESP-4]
// return...
In both cases the same thing happens except that in the first case we take care to decrement the stack pointer before using the memory, and then to increment it afterwards. In the latter case we do not.
Notwithstanding any real need to persist this value on the stack (reentrancy issues, function calls between PUSHing and reading the value back, etc) is there any fundamental reason why writing to the stack below ESP like this would be invalid?

TL:DR: no, there are some SEH corner cases that can make it unsafe in practice, as well as being documented as unsafe. #Raymond Chen recently wrote a blog post that you should probably read instead of this answer.
His example of a code-fetch page-fault I/O error that can be "fixed" by prompting the user to insert a CD-ROM and retry is also my conclusion for the only practically-recoverable fault if there aren't any other possibly-faulting instructions between store and reload below ESP/RSP.
Or if you ask a debugger to call a function in the program being debugged, it will also use the target process's stack.
This answer has a list of some things you'd think would potentially step on memory below ESP, but actually don't, which might be interesting. It seems to be only SEH and debuggers that can be a problem in practice.
First of all, if you care about efficiency, can't you avoid x87 in your calling convention? movd xmm0, eax is a more efficient way to return a float that was in an integer register. (And you can often avoid moving FP values to integer registers in the first place, using SSE2 integer instructions to pick apart exponent / mantissa for a log(x), or integer add 1 for nextafter(x).) But if you need to support very old hardware, then you need a 32-bit x87 version of your program as well as an efficient 64-bit version.
But there are other use-cases for small amounts of scratch space on the stack where it would be nice to save a couple instructions that offset ESP/RSP.
Trying to collect up the combined wisdom of other answers and discussion in comments under them (and on this answer):
It is explicitly documented as being not safe by Microsoft: (for 64-bit code, I didn't find an equivalent statement for 32-bit code but I'm sure there is one)
Stack Usage (for x64)
All memory beyond the current address of RSP is considered volatile: The OS, or a debugger, may overwrite this memory during a user debug session, or an interrupt handler.
So that's the documentation, but the interrupt reason stated doesn't make sense for the user-space stack, only the kernel stack. The important part is that they document it as not guaranteed safe, not the reasons given.
Hardware interrupts can't use the user stack; that would let user-space crash the kernel with mov esp, 0, or worse take over the kernel by having another thread in the user-space process modify return addresses while an interrupt handler was running. This is why kernels always configure things so interrupt context is pushed onto the kernel stack.
Modern debuggers run in a separate process, and are not "intrusive". Back in 16-bit DOS days, without a multi-tasking protected-memory OS to give each task its own address space, debuggers would use the same stack as the program being debugged, between any two instructions while single-stepping.
#RossRidge points out that a debugger might want to let you call a function in the context of the current thread, e.g. with SetThreadContext. This would run with ESP/RSP just below the current value. This could obviously have side-effects for the process being debugged (intentional on the part of the user running the debugger), but clobbering local variables of the current function below ESP/RSP would be an undesirable and unexpected side-effect. (So compilers can't put them there.)
(In a calling convention with a red-zone below ESP/RSP, a debugger could respect that red-zone by decrementing ESP/RSP before making the function call.)
There are existing program that intentionally break when being debugged at all, and consider this a feature (to defend against efforts to reverse-engineer them).
Related: the x86-64 System V ABI (Linux, OS X, all other non-Windows systems) does define a red-zone for user-space code (64-bit only): 128 bytes below RSP that is guaranteed not to be asynchronously clobbered. Unix signal handlers can run asynchronously between any two user-space instructions, but the kernel respects the red-zone by leaving a 128 byte gap below the old user-space RSP, in case it was in use. With no signal handlers installed, you have an effectively unlimited red-zone even in 32-bit mode (where the ABI does not guarantee a red-zone). Compiler-generated code, or library code, of course can't assume that nothing else in the whole program (or in a library the program called) has installed a signal handler.
So the question becomes: is there anything on Windows that can asynchronously run code using the user-space stack between two arbitrary instructions? (i.e. any equivalent to a Unix signal handler.)
As far as we can tell, SEH (Structured Exception Handling) is the only real obstacle to what you propose for user-space code on current 32 and 64-bit Windows. (But future Windows could include a new feature.)
And I guess debugging if you happen ask your debugger to call a function in the target process/thread as mentioned above.
In this specific case, not touching any other memory other than the stack, or doing anything else that could fault, it's probably safe even from SEH.
SEH (Structured Exception Handling) lets user-space software have hardware exceptions like divide by zero delivered somewhat similarly to C++ exceptions. These are not truly asynchronous: they're for exceptions triggered by instructions you ran, not for events that happened to come after some random instruction.
But unlike normal exceptions, one thing a SEH handler can do is resume from where the exception occurred. (#RossRidge commented: SEH handlers are are initially called in the context of the unwound stack and can choose to ignore the exception and continue executing at the point where the exception occurred.)
So that's a problem even if there's no catch() clause in the current function.
Normally HW exceptions can only be triggered synchronously. e.g. by a div instruction, or by a memory access which could fault with STATUS_ACCESS_VIOLATION (the Windows equivalent of a Linux SIGSEGV segmentation fault). You control what instructions you use, so you can avoid instructions that might fault.
If you limit your code to only accessing stack memory between the store and reload, and you respect the stack-growth guard page, your program won't fault from accessing [esp-4]. (Unless you reached the max stack size (Stack Overflow), in which case push eax would fault, too, and you can't really recover from this situation because there's no stack space for SEH to use.)
So we can rule out STATUS_ACCESS_VIOLATION as a problem, because if we get that on accessing stack memory we're hosed anyway.
An SEH handler for STATUS_IN_PAGE_ERROR could run before any load instruction. Windows can page out any page it wants to, and transparently page it back in if it's needed again (virtual memory paging). But if there's an I/O error, your Windows attempts to let your process handle the failure by delivering a STATUS_IN_PAGE_ERROR
Again, if that happens to the current stack, we're hosed.
But code-fetch could cause STATUS_IN_PAGE_ERROR, and you could plausibly recover from that. But not by resuming execution at the place where the exception occurred (unless we can somehow remap that page to another copy in a highly fault-tolerant system??), so we might still be ok here.
An I/O error paging in the code that wants to read what we stored below ESP rules out any chance of reading it. If you weren't planning to do that anyway, you're fine. A generic SEH handler that doesn't know about this specific piece of code wouldn't be trying to do that anyway. I think usually a STATUS_IN_PAGE_ERROR would at most try to print an error message or maybe log something, not try to carry on whatever computation was happening.
Accessing other memory in between the store and reload to memory below ESP could trigger a STATUS_IN_PAGE_ERROR for that memory. In library code, you probably can't assume that some other pointer you passed isn't going to be weird and the caller is expecting to handle STATUS_ACCESS_VIOLATION or PAGE_ERROR for it.
Current compilers don't take advantage of space below ESP/RSP on Windows, even though they do take advantage of the red-zone in x86-64 System V (in leaf functions that need to spill / reload something, exactly like what you're doing for int -> x87.) That's because MS says it isn't safe, and they don't know whether SEH handlers exist that could try to resume after an SEH.
Things that you'd think might be a problem in current Windows, and why they're not:
The guard page stuff below ESP: as long as you don't go too far below the current ESP, you'll be touching the guard page and trigger allocation of more stack space instead of faulting. This is fine as long as the kernel doesn't check user-space ESP and find out that you're touching stack space without having "reserved" it first.
kernel reclaim of pages below ESP/RSP: apparently Windows doesn't currently do this. So using a lot of stack space once ever will keep those pages allocated for the rest of your process lifetime, unless you manually VirtualAlloc(MEM_RESET) them. (The kernel would be allowed to do this, though, because the docs say memory below RSP is volatile. The kernel could effectively zero it asynchronously if it wants to, copy-on-write mapping it to a zero page instead of writing it to the pagefile under memory pressure.)
APC (Asynchronous Procedure Calls): They can only be delivered when the process is in an "alertable state", which means only when inside a call to a function like SleepEx(0,1). calling a function already uses an unknown amount of space below E/RSP, so you already have to assume that every call clobbers everything below the stack pointer. Thus these "async" callbacks are not truly asynchronous with respect to normal execution the way Unix signal handlers are. (fun fact: POSIX async io does use signal handlers to run callbacks).
Console-application callbacks for ctrl-C and other events (SetConsoleCtrlHandler). This looks exactly like registering a Unix signal handler, but in Windows the handler runs in a separate thread with its own stack. (See RbMm's comment)
SetThreadContext: another thread could change our EIP/RIP asynchronously while this thread is suspended, but the whole program has to be written specially for that to make any sense. Unless it's a debugger using it. Correctness is normally not required when some other thread is messing around with your EIP unless the circumstances are very controlled.
And apparently there are no other ways that another process (or something this thread registered) can trigger execution of anything asynchronously with respect to the execution of user-space code on Windows.
If there are no SEH handlers that could try to resume, Windows more or less has a 4096 byte red-zone below ESP (or maybe more if you touch it incrementally?), but RbMm says nobody takes advantage of it in practice. This is unsurprising because MS says not to, and you can't always know if your callers might have done something with SEH.
Obviously anything that would synchronously clobber it (like a call) must also be avoided, again same as when using the red-zone in the x86-64 System V calling convention. (See https://stackoverflow.com/tags/red-zone/info for more about it.)

in general case (x86/x64 platform) - interrupt can be executed at any time, which overwrite memory bellow stack pointer (if it executed on current stack). because this, even temporary save something bellow stack pointer, not valid in kernel mode - interrupt will be use current kernel stack. but in user mode situation another - windows build interrupt table (IDT) suchwise that when interrupt raised - it will be always executed in kernel mode and in kernel stack. as result user mode stack (below stack pointer) will be not affected. and possible temporary use some stack space bellow it pointer, until you not do any functions calls. if exception will be (say by access invalid address) - also space bellow stack pointer will be overwritten - cpu exception of course begin executed in kernel mode and kernel stack, but than kernel execute callback in user space via ntdll.KiDispatchExecption already on current stack space. so in general this is valid in windows user mode (in current implementation), but you need good understand what you doing. however this is very rarely i think used
of course, how correct noted in comments that we can, in windows user mode, write below stack pointer - is just the current implementation behavior. this not documented or guaranteed.
but this is very fundamental - unlikely will be changed: interrupts always will be executed in privileged kernel mode only. and kernel mode will be use only kernel mode stack. the user mode context not trusted at all. what will be if user mode program set incorrect stack pointer ? say by
mov rsp,1 or mov esp,1 ? and just after this instruction interrupt will be raised. what will be if it begin executed on such invalid esp/rsp ? all operation system just crashed. exactly because this interrupt will be executed only on kernel stack. and not overwrite user stack space.
also need note that stack is limited space (even in user mode), access it bellow 1 page (4Kb)already error (need do stack probing page by page, for move guard page down).
and finally really there is no need usually access [ESP-4], EAX - in what problem decrement ESP first ? even if we need access stack space in loop huge count of time - decrement stack pointer need only once - 1 additional instruction (not in loop) nothing change in performance or code size.
so despite formal this is will be correct work in windows user mode, better (and not need) use this
of course formal documentation say:
Stack Usage
All memory beyond the current address of RSP is considered volatile
but this is for common case, including kernel mode too. i wrote about user mode and based on current implementation
possible in future windows and add "direct" apc or some "direct" signals - some code will be executed via callback just after thread enter to kernel (during usual hardware interrupt). after this all below esp will be undefined. but until this not exist. until this code will be work always(in current builds) correct.

In general (not specifically related to any OS); it's not safe to write below ESP if:
It's possible for the code to be interrupted and the interrupt handler will run at the same privilege level. Note: This is typically very unlikely for "user-space" code, but extremely likely for kernel code.
You call any other code (where either the call or the stack used by the called routine can trash the data you stored below ESP)
Something else depends on "normal" stack use. This can include signal handling, (language based) exception unwinding, debuggers, "stack smashing protector"
It's safe to write below ESP if it's not "not safe".
Note that for 64-bit code, writing below RSP is built into the x86-64 ABI ("red zone"); and is made safe by support for it in tool chains/compilers and everything else.

When a thread gets created, Windows reserves a contiguous region of virtual memory of a configurable size (the default is 1 MB) for the thread's stack. Initially, the stack looks like this (the stack grows downwards):
--------------
| committed |
--------------
| guard page |
--------------
| . |
| reserved |
| . |
| . |
| |
--------------
ESP will be pointing somewhere inside the committed page. The guard page is used to support automatic stack growth. The reserved pages region ensures that the requested stack size is available in virtual memory.
Consider the two instructions from the question:
MOV [ESP-4], EAX
FLD [ESP-4]
There are three possibilities:
The first instruction executes successfully. There is nothing that uses the user-mode stack that can execute between the two instructions. So the second instruction will use the correct value (#RbMm stated this in the comments under his answer and I agree).
The first instruction raises an exception and an exception handler does not return EXCEPTION_CONTINUE_EXECUTION. As long as the second instruction is immediately after the first one (it is not in the exception handler or placed after it), then the second instruction will not execute. So you're still safe. Execution continues from stack frame where the exception handler exists.
The first instruction raises an exception and an exception handler returns EXCEPTION_CONTINUE_EXECUTION. Execution continues from the same instruction that raised the exception (potentially with a context modified by the handler). In this particular example, the first will be re-executed to write a value below ESP. No problem. If the second instruction raised an exception or there are more than two instructions, then the exception might occur a place after a value is written below ESP. When the exception handler gets called, it may overwrite the value and then return EXCEPTION_CONTINUE_EXECUTION. But when execution resumes, the value written is assumed to still be there, but it's not anymore. This is a situation where it's not safe to write below ESP. This applies even if all of the instructions are placed consecutively. Thanks to #RaymondChen for pointing this out.
In general, if the two instructions are not placed back-to-back, if you are writing to locations beyond ESP, there is no guarantee that the written values won't get corrupted or overwritten. One case that I can think of where this might happen is structured exception handling (SEH). If a hardware-defined exception (such as divide by zero) occurs, the kernel exception handler will be invoked (KiUserExceptionDispatcher) in kernel-mode, which will invoke the user-mode side of the handler (RtlDispatchException). When switching from user-mode to kernel-mode and then back to user-mode, whatever value was in ESP will be saved and restored. However, the user-mode handler itself uses the user-mode stack and will iterate over a registered list of exception handlers, each of which uses the user-mode stack. These functions will modify ESP as required. This may lead to losing the values you've written beyond ESP. A similar situation occurs when using software-define exceptions (throw in VC++).
I think you can deal with this by registering your own exception handler before any other exception handlers (so that it is called first). When your handler gets called, you can save your data beyond ESP elsewhere. Later, during unwinding, you get the cleanup opportunity to restore your data to the same location (or any other location) on the stack.
You need also to similarly watch out for asynchronous procedure calls (APCs) and callbacks.

Several answers here mention APCs (Asynchronous Procedure Calls), saying that they can only be delivered when the process is in an "alertable state", and are not truly asynchronous with respect to normal execution the way Unix signal handlers are
Windows 10 version 1809 introduces Special User APCs, which can fire at any moment just like Unix signals. See this article for low level details.
The Special User APC is a mechanism that was added in RS5 (and exposed through NtQueueApcThreadEx), but lately (in an insider build) was exposed through a new syscall - NtQueueApcThreadEx2. If this type of APC is used, the thread is signaled in the middle of the execution to execute the special APC.

How an assembler instruction could not read the memory it is placed at

Using some software in Windows XP that works as a Windows service and doing a restart from the logon screen I see an infamous error message
The instruction at "00x..." referenced memory at "00x...". The memory
could not be read.
I reported the problem to the developers, but looking at the message once again, I noticed that the addresses are the same. So
The instruction at "00xdf3251" referenced memory at "00xdf3251". The memory
could not be read.
Whether this is a bug in the program or not, but what is the state of the memory/access rights or something else that prevents an instruction from reading the memory it is placed. Is it something specific to services?

I would guess there was an attempt to execute an instruction at the address 0xdf3251 and that location wasn't backed up by a readable and executable page of memory (perhaps, completely unmapped).
If that's the case, the exception (page fault, in fact) originates from that instruction and the exception handler has its address on the stack (the location to return to, in case the exception can be somehow resolved and the faulting instruction restarted when the handler returns). And that's the first address you're seeing.
The CR2 register that the page fault handler reads, which is the second address you're seeing, also has the same address because it has to contain the address of an inaccessible memory location irrespective of whether the page fault has been caused by:
complete absence of mapping (there's no page mapped at all)
lack of write permission (the page is read-only)
lack of execute permission (the page has the no-execute bit set) OR
lack of kernel privilege (the page is marked as accessible only in the kernel)
and irrespective of whether it was during a data access or while fetching an instruction (the latter being our case).
That's how you can get the instruction and memory access addresses equal.
Most likely the code had a bug resulting in a memory corruption and some pointer (or a return address on the stack) was overwritten with a bogus value pointing to an inaccessible memory location. And then one way or the other the CPU was directed to continue execution there (most likely using one of these instructions: jmp, call, ret). There's also a chance of having a race condition somewhere.

This kind of crash is most typically caused by stack corruption. A very common kind is a stack buffer overflow. Write too much data in an array stored on the stack and it overwrites a function's return address with the data. When the function then returns, it jumps to the bogus return address and the program falls over because there's no code at the address. They'll have a hard time fixing the bug since there's no easy way to find out where the corruption occurred.
This is a rather infamous kind of bug, it is a major attack vector for malware. Since it can commandeer a program to jump to arbitrary code with data. You ought to have a sitdown with these devs and point this out, it is a major security risk. The cure is easy enough, they should update their tools. Countermeasures against buffer overflow are built into the compilers these days.

Can address space be recycled for multiple calls to MapViewOfFileEx without chance of failure?

Consider a complex, memory hungry, multi threaded application running within a 32bit address space on windows XP.
Certain operations require n large buffers of fixed size, where only one buffer needs to be accessed at a time.
The application uses a pattern where some address space the size of one buffer is reserved early and is used to contain the currently needed buffer.
This follows the sequence:
(initial run) VirtualAlloc -> VirtualFree -> MapViewOfFileEx
(buffer changes) UnMapViewOfFile -> MapViewOfFileEx
Here the pointer to the buffer location is provided by the call to VirtualAlloc and then that same location is used on each call to MapViewOfFileEx.
The problem is that windows does not (as far as I know) provide any handshake type operation for passing the memory space between the different users.
Therefore there is a small opportunity (at each -> in my above sequence) where the memory is not locked and another thread can jump in and perform an allocation within the buffer.
The next call to MapViewOfFileEx is broken and the system can no longer guarantee that there will be a big enough space in the address space for a buffer.
Obviously refactoring to use smaller buffers reduces the rate of failures to reallocate space.
Some use of HeapLock has had some success but this still has issues - something still manages to steal some memory from within the address space.
(We tried Calling GetProcessHeaps then using HeapLock to lock all of the heaps)
What I'd like to know is there anyway to lock a specific block of address space that is compatible with MapViewOfFileEx?
Edit: I should add that ultimately this code lives in a library that gets called by an application outside of my control

You could brute force it; suspend every thread in the process that isn't the one performing the mapping, Unmap/Remap, unsuspend the suspended threads. It ain't elegant, but it's the only way I can think of off-hand to provide the kind of mutual exclusion you need.

Have you looked at creating your own private heap via HeapCreate? You could set the heap to your desired buffer size. The only remaining problem is then how to get MapViewOfFileto use your private heap instead of the default heap.
I'd assume that MapViewOfFile internally calls GetProcessHeap to get the default heap and then it requests a contiguous block of memory. You can surround the call to MapViewOfFile with a detour, i.e., you rewire the GetProcessHeap call by overwriting the method in memory effectively inserting a jump to your own code which can return your private heap.
Microsoft has published the Detour Library that I'm not directly familiar with however. I know that detouring is surprisingly common. Security software, virus scanners etc all use such frameworks. It's not pretty, but may work:
HANDLE g_hndPrivateHeap;
HANDLE WINAPI GetProcessHeapImpl() {
return g_hndPrivateHeap;
}
struct SDetourGetProcessHeap { // object for exception safety
SDetourGetProcessHeap() {
// put detour in place
}
~SDetourGetProcessHeap() {
// remove detour again
}
};
void MapFile() {
g_hndPrivateHeap = HeapCreate( ... );
{
SDetourGetProcessHeap d;
MapViewOfFile(...);
}
}
These may also help:
How to replace WinAPI functions calls in the MS VC++ project with my own implementation (name and parameters set are the same)?
How can I hook Windows functions in C/C++?
http://research.microsoft.com/pubs/68568/huntusenixnt99.pdf

Imagine if I came to you with a piece of code like this:
void *foo;
foo = malloc(n);
if (foo)
free(foo);
foo = malloc(n);
Then I came to you and said, help! foo does not have the same address on the second allocation!
I'd be crazy, right?
It seems to me like you've already demonstrated clear knowledge of why this doesn't work. There's a reason that the documention for any API that takes an explicit address to map into lets you know that the address is just a suggestion, and it can't be guaranteed. This also goes for mmap() on POSIX.
I would suggest you write the program in such a way that a change in address doesn't matter. That is, don't store too many pointers to quantities inside the buffer, or if you do, patch them up after reallocation. Similar to the way you'd treat a buffer that you were going to pass into realloc().
Even the documentation for MapViewOfFileEx() explicitly suggests this:
While it is possible to specify an address that is safe now (not used by the operating system), there is no guarantee that the address will remain safe over time. Therefore, it is better to let the operating system choose the address. In this case, you would not store pointers in the memory mapped file, you would store offsets from the base of the file mapping so that the mapping can be used at any address.
Update from your comments
In that case, I suppose you could:
Not map into contiguous blocks. Perhaps you could map in chunks and write some intermediate function to decide which to read from/write to?
Try porting to 64 bit.

As the earlier post suggests, you can suspend every thread in the process while you change the memory mappings. You can use SuspendThread()/ResumeThread() for that. This has the disadvantage that your code has to know about all the other threads and hold thread handles for them.
An alternative is to use the Windows debug API to suspend all threads. If a process has a debugger attached, then every time the process faults, Windows will suspend all of the process's threads until the debugger handles the fault and resumes the process.
Also see this question which is very similar, but phrased differently:
Replacing memory mappings atomically on Windows

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio