Can you "allocate" stack space with VirtualAlloc? - winapi

I'm messing around with VirtualAlloc and dynamic code generation, and I've become curious about something.
The first parameter of VirtualAlloc specifies the start of the address range to be allocated, or more accurately, the page containing that address specifies the start of the page range to be allocated. Right?
I started wondering. Could you just make a bunch of space on the stack and "allocate" that memory with VirtualAlloc? For instance, to change its permissions to PAGE_EXECUTE_READWRITE?
(As an extension of the above, I'm curious where exactly the stack is in a Windows process. How is it set up? What sets it up?)
tl;dr Can you "allocate" stack space with VirtualAlloc?

Stack space is allocated by VirtualAlloc and the MEM_RESERVE flag (or perhaps directly using the underlying syscall) when a thread is created. This causes a chuck of the process's address space to be reserved for that thread stack.
A guard page is used to cause an access-violation when the stack grows past the region which is actually committed. The OS handles this automatically, by committing additional memory (if there is enough reserved space) or generating EXCEPTION_STACK_OVERFLOW to the process if the edge of the reserved area is reached. In the first case, a new guard page is set up. In the second, recreating the guard page is an important step if you try to handle that exception and recover.
You could use VirtualAlloc and VirtualProtect to precommit your thread's stack. But they don't touch the stack pointer, so they can't be used for stack allocation (code using the stack pointer would happily reuse "your" allocation for automatic variables, function parameters, etc). To allocate space from the stack, you need to adjust the stack pointer. Most C and C++ compilers provide an _alloca() intrinsic for doing this.
If you're doing dynamic code generation, don't use the stack for that. Non-executable stack is a valuable protection against remote execution vulnerabilities. You certainly can use VirtualAlloc for dynamic allocation in specialized cases like this, instead of the general-purpose allocators HeapAlloc and malloc and new[]. The general-purpose allocators all ultimately get their memory from VirtualAlloc, but then parcel it out in chunks that don't line up with page boundaries.

Related

How does the linux kernel avoid the stack overwriting the text (instructions)?

I was curious about how the kernel prevents the stack from growing too big, and I found this Q/A:
Q: how does the linux kernel enforce stack size limits?
A: The kernel can control this due to the virtual memory. The virtual
memory (also known as memory mapping), is basically a list of virtual
memory areas (base + size) and a target physically memory area that
the kernel can manipulate that is unique to each program. When a
program tries to access an address that is not on this list, an
exception happens. This exception will cause a context switch into
kernel mode. The kernel can look up the fault. If the memory is to
become valid, it will be put into place before the program can
continue (swap and mmap not read from disk yet for instance) or a
SEGFAULT can be generated.
In order to decide the stack size limit, the kernel simply manipulates
the virtual memory map. - Stian Skjelstad
But I didn't quite find this answer satisfactory. "When a program tries to access an address that is not on this list, an exception happens." - But wouldn't the text section (instructions) of the program be part of the virtual memory map?
I'm asking about how the kernel enforces the stack size of user programs.
There's a growth limit, set with ulimit -s for the main stack, that will stop the stack from getting anywhere near .text. (And the guard pages below that make sure there's a segfault if the stack does overflow past the growth limit.) See How is Stack memory allocated when using 'push' or 'sub' x86 instructions?. (Or for thread stacks (not the main thread), stack memory is just a normal mmap allocation with no growth; the only lazy allocation is physical pages to back the virtual ones.)
Also, .text is a read+exec mapping of the executable, so there's no way to modify it without calling mprotect first. (It's a private mapping, so doing so would only affect the pages in memory, not the actual file. This is how text relocations work: runtime fixups for absolute addresses, to be fixed up by the dynamic linker.)
The actual mechanism for limiting growth is by simply not extending the mapping and allocating a new page when the process triggers a hardware page fault with the stack pointer below the existing stack area. Thus the page fault is an invalid one, instead of a soft aka minor for the normal stack-growth case, so a SIGSEGV is delivered.
If a program used alloca or a C99 VLA with an unchecked size, malicious input could make it jump over any guard pages and into some other read/write mapping such as .data or stuff that's dynamically allocated.
To harden buggy code against that so it segfaults instead of actually allowing a stack clash attack, there are compiler options that make it touch every intervening page as the stack grows, so it's certain to set off the "tripwire" in the form of an unmapped guard page below the stack-growth limit. See Linux process stack overrun by local variables (stack guarding)
If you set ulimit -s unlimited could you maybe grow the stack into some other mapping, if Linux truly does allow unlimited growth in that case without reserving a guard page as you approach another mapping.

How does macOS allocate stack and heap for a process?

I want to know how macOS allocate stack and heap memory for a process, i.e. the memory layout of a process in macOS. I only know that the segments of a mach-o executable are loaded into pages, but I can't find a segment that correspond to stack or heap area of a process. Is there any document about that?
Stacks and heaps are just memory. The only think that makes a stack a stack or a heap or a heap is the way it is accessed. Stacks and heaps are allocated the same way all memory is: by mapping pages into the logical address space.
Let's take a step back - the Mach-o format describes mapping the binary segments into virtual memory. Importantly the memory pages you mentioned have read write and execute permissions. If it's an executable(i.e. not a dylib) it must contain the __PAGEZERO segment with no permissions at all. This is the safe guard area to prevent accessing low addresses of virtual memory by accident (here falls the infamous Null pointer exception and such if attempting to access zero memory address).
__TEXT read executable (typically without write) segment follows which in virtual memory will contain the file representation itself. This implies all the executable code lives here. Also immmutable data like string constants.
The order may vary, but usually next you will encounter __LINKEDIT read only segment. This is the segment dyld uses to setup externally loaded functions, this is too broad to cover here, but there are numerous answers on the topic.
Finally we have the readable writable __DATA segment the first place a process can actually write to. This is used for global/static variables, external addresses to calls populated by dyld.
We have roughly covered the process initial setup when it will launch through either LC_UNIXTHREAD or in modern MacOS (10.7+) LC_MAIN. This starts the process main thread. Each thread must contain it's own stack. The creation of it is handled by operating system (including allocating it). Notice so far the process has no awareness of the heap at all (it's the operating system that's doing the heavy lifting to prepare the stack).
So to sum up so far we have 2 independent sources of memory - the process memory representing the Mach-o structure (size is fixed and determined by the executable structure) and the main thread stack (also with predefined size). The process is about to run a C-like main function , any local variables declared would move the thread stack pointer, likewise any calls to functions (local and external) to at least setup the stack frame for return address. Accessing a global/static variable would reference the __DATA segment virtual memory directly.
Reserving stack space in x86-64 assembly would look like this:
sub rsp,16
There are some great SO anwers on System V / AMD64 ABI (which includes MacOS) requirements for stack alignment like this one
Any new thread created will have its own stack to allow setting up stack frames for local variables and calling functions.
Now we can cover heap allocation - which is mitigated by the libSystem (aka MacOS C standard library) delivering the malloc/free. Internally this is handled by mmap & munmap system calls - the kernel API for managing memory pages.
Using those system calls directly is possible, but might turned out inefficient, thus an internal memory pool is utilised by malloc/free to limit the number of system calls (which are costly to make).
The changing addresses you mentioned in the comment are caused by:
ASLR aka PIE (position independent code) for process memory , which is a security measure randomizing the start of virtual memory
Thread local stacks being prepared by the operating system

Does memory layout of a program depend on address binding technique?

I have learned that with run-time address binding, the program can be allocated frames in the physical memory non-contiguously. Also, as described here and here, every segment of the program in the logical address space is contiguous, but not all segments are placed together side-by-side. The text, data, BSS and heap segments are placed together, but the stack segment is not. In other words, there are pages between the heap and the stack segments (between the program break and stack top) in the logical address space that are not mapped to any frames in the physical address space, thus implying that the logical address space is non-contiguous in the case of run-time address binding.
But what about the memory layout in the case of compile-time or load-time binding ? Now that the logical address space in not an abstract address space but the actual physical address space, how is a program laid out in the physical memory ? More specifically, how is the stack segment placed in the physical address space of a program ? Is it placed together with the rest of the segments or separately just as in the case of run-time binding ?
To answer your quesitons, I first have to explain a bit about stack and heap allocation in modern operating systems.
Stack as the name suggest is the continues memory allocation, where cpu uses push and pop commands to add/remove data from top of the stack. I assume that you already know how stack works. process stores - return address, function arguments and local variables over stack. Every time a function is called, more data is pushed (it could ultimately lead to stack overflow is no data is popped ever - infinite recursion?). Stack size is fixed for a program when it is loaded in the memory. Most of the programming language lets you decide stack size during compilation. If not, they will decide a default. On Linux, maximum stack size(hard limit) is limited by ulimit. You can check and set the size by ulimit -s.
Heap space however, has no upper limit in *nix systems(depends, confirm it using ulimit -v), every program starts with a default/set amount of heap and can increase as much needed. Heap space in a process is actually two linked lists, free and used blocks. Whenever memory allocation is required from heap, one or more free blocks are combined to form a bigger block and allocated to the used list as a single block. Freeing up means removing a block from the used list to the free list. After Freeing the blocks, heap can have external fragmentation. Now if the the number of free blocks can't contain the whole data, process will request more memory from the OS, Generally the newer blocks are allocated from a higher address. Thus we show upward direction diagram for heap growth. I rephrase - Heap does not allocate memory continuously in a higher direction.
Now to answer your questions.
With compile-time or load-time address binding, how are the stack and
the heap segments placed in the physical address space of a program ?
Fixed stack is allocated at compile time, with some heap memory. How are placed have been explained above.
Is the space between the heap and the stack reserved for the program
or is it available for the OS to be used for other programs ?
Yes it is reserved for the program. Process however can request more memory to add free blocks in its heap. It is different than sharing it's own heap.
Note: There are lots of topic which can be covered here as the question is broad. Some of them are - garbage collection, block selection, shared memory etc. I will soon add the references here.
References:-
Memory Management in JVM
Stack vs Heap
Heap memory allocation strategies

windows memory management: Difference between VirtualAlloc() and heap functions [duplicate]

There are lots of method to allocate memory in Windows environment, such as VirtualAlloc, HeapAlloc, malloc, new.
Thus, what's the difference among them?
Each API is for different uses. Each one also requires that you use the correct deallocation/freeing function when you're done with the memory.
VirtualAlloc
A low-level, Windows API that provides lots of options, but is mainly useful for people in fairly specific situations. Can only allocate memory in (edit: not 4KB) larger chunks. There are situations where you need it, but you'll know when you're in one of these situations. One of the most common is if you have to share memory directly with another process. Don't use it for general-purpose memory allocation. Use VirtualFree to deallocate.
HeapAlloc
Allocates whatever size of memory you ask for, not in big chunks than VirtualAlloc. HeapAlloc knows when it needs to call VirtualAlloc and does so for you automatically. Like malloc, but is Windows-only, and provides a couple more options. Suitable for allocating general chunks of memory. Some Windows APIs may require that you use this to allocate memory that you pass to them, or use its companion HeapFree to free memory that they return to you.
malloc
The C way of allocating memory. Prefer this if you are writing in C rather than C++, and you want your code to work on e.g. Unix computers too, or someone specifically says that you need to use it. Doesn't initialise the memory. Suitable for allocating general chunks of memory, like HeapAlloc. A simple API. Use free to deallocate. Visual C++'s malloc calls HeapAlloc.
new
The C++ way of allocating memory. Prefer this if you are writing in C++. It puts an object or objects into the allocated memory, too. Use delete to deallocate (or delete[] for arrays). Visual studio's new calls HeapAlloc, and then maybe initialises the objects, depending on how you call it.
In recent C++ standards (C++11 and above), if you have to manually use delete, you're doing it wrong and should use a smart pointer like unique_ptr instead. From C++14 onwards, the same can be said of new (replaced with functions such as make_unique()).
There are also a couple of other similar functions like SysAllocString that you may be told you have to use in specific circumstances.
It is very important to understand the distinction between memory allocation APIs (in Windows) if you plan on using a language that requires memory management (like C or C++.) And the best way to illustrate it IMHO is with a diagram:
Note that this is a very simplified, Windows-specific view.
The way to understand this diagram is that the higher on the diagram a memory allocation method is, the higher level implementation it uses. But let's start from the bottom.
Kernel-Mode Memory Manager
It provides all memory reservations & allocations for the operating system, as well as support for memory-mapped files, shared memory, copy-on-write operations, etc. It's not directly accessible from the user-mode code, so I'll skip it here.
VirtualAlloc / VirtualFree
These are the lowest level APIs available from the user mode. The VirtualAlloc function basically invokes ZwAllocateVirtualMemory that in turn does a quick syscall to ring0 to relegate further processing to the kernel memory manager. It is also the fastest method to reserve/allocate block of new memory from all available in the user mode.
But it comes with two main conditions:
It only allocates memory blocks aligned on the system granularity boundary.
It only allocates memory blocks of the size that is the multiple of the system granularity.
So what is this system granularity? You can get it by calling GetSystemInfo. It is returned as the dwAllocationGranularity parameter. Its value is implementation (and possibly hardware) specific, but on many 64-bit Windows systems it is set at 0x10000 bytes, or 64K.
So what all this means, is that if you try to allocate, say just an 8 byte memory block with VirtualAlloc:
void* pAddress = VirtualAlloc(NULL, 8, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
If successful, pAddress will be aligned on the 0x10000 byte boundary. And even though you requested only 8 bytes, the actual memory block that you will get will be the entire page (or, something like 4K bytes. The exact page size is returned in the dwPageSize parameter.) But, on top of that, the entire memory block spanning 0x10000 bytes (or 64K in most cases) from pAddress will not be available for any further allocations. So in a sense, by allocating 8 bytes you could as well be asking for 65536.
So the moral of the story here is not to substitute VirtualAlloc for generic memory allocations in your application. It must be used for very specific cases, as is done with the heap below. (Usually for reserving/allocating large blocks of memory.)
Using VirtualAlloc incorrectly can lead to severe memory fragmentation.
HeapCreate / HeapAlloc / HeapFree / HeapDestroy
In a nutshell, the heap functions are basically a wrapper for VirtualAlloc function. Other answers here provide a pretty good concept of it. I'll add that, in a very simplistic view, the way heap works is this:
HeapCreate reserves a large block of virtual memory by calling VirtualAlloc internally (or ZwAllocateVirtualMemory to be specific). It also sets up an internal data structure that can track further smaller size allocations within the reserved block of virtual memory.
Any calls to HeapAlloc and HeapFree do not actually allocate/free any new memory (unless, of course the request exceeds what has been already reserved in HeapCreate) but instead they meter out (or commit) a previously reserved large chunk, by dissecting it into smaller memory blocks that a user requests.
HeapDestroy in turn calls VirtualFree that actually frees the virtual memory.
So all this makes heap functions perfect candidates for generic memory allocations in your application. It is great for arbitrary size memory allocations. But a small price to pay for the convenience of the heap functions is that they introduce a slight overhead over VirtualAlloc when reserving larger blocks of memory.
Another good thing about heap is that you don't really need to create one. It is generally created for you when your process starts. So one can access it by calling GetProcessHeap function.
malloc / free
Is a language-specific wrapper for the heap functions. Unlike HeapAlloc, HeapFree, etc. these functions will work not only if your code is compiled for Windows, but also for other operating systems (such as Linux, etc.)
This is a recommended way to allocate/free memory if you program in C. (Unless, you're coding a specific kernel mode device driver.)
new / delete
Come as a high level (well, for C++) memory management operators. They are specific for the C++ language, and like malloc for C, are also the wrappers for the heap functions. They also have a whole bunch of their own code that deals C++-specific initialization of constructors, deallocation in destructors, raising an exception, etc.
These functions are a recommended way to allocate/free memory and objects if you program in C++.
Lastly, one comment I want to make about what has been said in other responses about using VirtualAlloc to share memory between processes. VirtualAlloc by itself does not allow sharing of its reserved/allocated memory with other processes. For that one needs to use CreateFileMapping API that can create a named virtual memory block that can be shared with other processes. It can also map a file on disk into virtual memory for read/write access. But that is another topic.
VirtualAlloc is a specialized allocation of the OS virtual memory (VM) system. Allocations in the VM system must be made at an allocation granularity which (the allocation granularity) is architecture dependent. Allocation in the VM system is one of the most basic forms of memory allocation. VM allocations can take several forms, memory is not necessarily dedicated or physically backed in RAM (though it can be). VM allocation is typically a special purpose type of allocation, either because of the allocation has to
be very large,
needs to be shared,
must be aligned on a particular value (performance reasons) or
the caller need not use all of this memory at once...
etc...
HeapAlloc is essentially what malloc and new both eventually call. It is designed to be very fast and usable under many different types of scenarios of a general purpose allocation. It is the "Heap" in a classic sense. Heaps are actually setup by a VirtualAlloc, which is what is used to initially reserve allocation space from the OS. After the space is initialized by VirtualAlloc, various tables, lists and other data structures are configured to maintain and control the operation of the HEAP. Some of that operation is in the form of dynamically sizing (growing and shrinking) the heap, adapting the heap to particular usages (frequent allocations of some size), etc..
new and malloc are somewhat the same, malloc is essentially an exact call into HeapAlloc( heap-id-default ); new however, can [additionally] configure the allocated memory for C++ objects. For a given object, C++ will store vtables on the heap for each caller. These vtables are redirects for execution and form part of what gives C++ it's OO characteristics like inheritance, function overloading, etc...
Some other common allocation methods like _alloca() and _malloca() are stack based; FileMappings are really allocated with VirtualAlloc and set with particular bit flags which designate those mappings to be of type FILE.
Most of the time, you should allocate memory in a way which is consistent with the use of that memory ;). new in C++, malloc for C, VirtualAlloc for massive or IPC cases.
*** Note, large memory allocations done by HeapAlloc are actually shipped off to VirtualAlloc after some size (couple hundred k or 16 MB or something I forget, but fairly big :) ).
*** EDIT
I briefly remarked about IPC and VirtualAlloc, there is also something very neat about a related VirtualAlloc which none of the responders to this question have discussed.
VirtualAllocEx is what one process can use to allocate memory in an address space of a different process. Most typically, this is used in combination to get remote execution in the context of another process via CreateRemoteThread (similar to CreateThread, the thread is just run in the other process).
In outline:
VirtualAlloc, HeapAlloc etc. are Windows APIs that allocate memory of various types from the OS directly. VirtualAlloc manages pages in the Windows virtual memory system, while HeapAlloc allocates from a specific OS heap. Frankly, you are unlikely to ever need to use eiither of them.
malloc is a Standard C (and C++) library function that allocates memory to your process. Implementations of malloc will typically use one of the OS APIs to create a pool of memory when your app starts and then allocate from it as you make malloc requests
new is a Standard C++ operator which allocates memory and then calls constructors appropriately on that memory. It may be implemented in terms of malloc or in terms of the OS APIs, in which case it too will typically create a memory pool on application startup.
VirtualAlloc ===> sbrk() under UNIX
HeapAlloc ====> malloc() under UNIX
VirtualAlloc => Allocates straight into virtual memory, you reserve/commit in blocks. This is great for large allocations, for example large arrays.
HeapAlloc / new => allocates the memory on the default heap (or any other heap that you may create). This allocates per object and is great for smaller objects. The default heap is serializable therefore it has guarantee thread allocation (this can cause some issues on high performance scenarios and that's why you can create your own heaps).
malloc => uses the C runtime heap, similar to HeapAlloc but it is common for compatibility scenarios.
In a nutshell, the heap is just a chunk of virtual memory that is governed by a heap manager (rather than raw virtual memory)
The last model on the memory world is memory mapped files, this scenario is great for large chunk of data (like large files). This is used internally when you open an EXE (it does not load the EXE in memory, just creates a memory mapped file).

Where can one find detailed information about stack operation in x86 processors

I am interested in the layout of an executable and dynamic memory allocation using stack and how the processor and kernel together manage the stack region, like during function calls and other scenarios of using stack based memory allocation. Also how stack overflow and other hazards associated with this model occur, are their other designs of code execution that are not stack based and don't have such issues. A video or an animation would be of great help.
Typically (any processor, not just x86) there is one ram address space, and typically the program is in lower memory and grows upwards as you run. Say your program is 0x1000 bytes and is loaded at 0x0000 then you do a malloc of 0x3000 bytes the address returned would be 0x1000 in this hypothetical situation and now the lower 0x4000 bytes are being actively used by the program. Additional mallocs continue to grow in this way. Free()s do not necessary cause this consumption to go down, it depends on how the memory is managed and the programs mixture of malloc()s and free()s.
The stack though normally goes from the top down. Say 0x10000 is the address the stack pointer starts at. Say you have a function that has three 32 bit unsigned int variables, and no parameters are passed in, you need three stack locations to hold those variables (assuming no optimization has reduced that requirement) so upon entry of the function the stack pointer is reduced by 3*4 = 12 bytes, so the stack pointer is changed to 0xFFF4, one of your variables is at address 0xFFF4+0 one at 0xFFF4+4 and the third at 0xFFF4+8. If that function calls another function then the stack pointer continues to move toward zero in memory. And as you continue to malloc() your used program memory grows upward. Unchecked they will collide, and the code needed to do that checking is cost prohibitive enough that it is rarely used. This is why local variables are good for optimizing and a few other things but bad because stack consumption is often non-deterministic or at least the analysis is not done by the average programmer.
On ISAs (instruction set architectures) like x86 where there is a limited number of usable registers then functions often need to pass arguments on the stack as well. The rules governing where and how things are passed and returned is defined and well understood by the compiler, this is not some random thing. Anyway, in addition to leaving room for the local variables some of the arguments to the function are on the stack and sometimes the return value is on the stack. In particular with x86, each function call causes the stack to grow downward, and functions calling functions makes that worse. Think about what recursion can do to your stack.
What are your alternatives? Use an instruction set with more registers with a function calling spec that uses more registers and less stack. Use fewer arguments when calling functions. Use fewer local variables. Malloc less. Use a good compiler with a good optimizer as well as help the optimizer by using easy to optimize habits when coding.
Realistically though, to have a generically useful processor for which you write generically useful programs you have to have a stack and the possibility that the stack overflows and/or collides with the heap.
Now the segmented memory model of the x86 as well as mmus in general give you the opportunity to keep the program memory and stack well away from each other. Also protection mechanisms can be used that if either the heap or the stack venture outside their allocated space a protection fault occurs. Still an oversight by the programmer but is easier to know what happened and debug it than the random side effects that occur when the stack grows down into program memory space. Using a protection mechanism like this is much easier solution to help the programmer control the stack growth than building something into the code generated by the compiler to check for a collision on every function call and malloc.
Another pitfall which is often asked in job interviews is something along the lines of:
int * myfun ( int a )
{
int i;
i=a+7;
return(&i);
}
This can take many forms, the thing to understand is that the variable i is temporarily allocated on the stack and is only allocated while the function is executing, when the function returns the stack pointer frees up the memory allocated for i and the next function called may very well clobber that memory. So by returning the address to a variable stored on the stack is a bad idea. Code that does something like this may run properly for weeks, months, years before being detected.
Now this is acceptable even on stack based cpus (the zylin zpu for example).
int myfun ( int a )
{
int i;
i=a+7;
return(i);
}
Partly because there isnt much you can do other than use globals (yes this specific case does not require the additional variable i, but assume your code is complicated enough that you need that local return variable), the second is because in C, the calling code frees up its portion of the stack. Meaning on an x86 for example if you call a function with two parameters on the stack, lets say two 4 byte ints, the calling code moves the stack pointer down by 8 and places those two parameters in that memory (sp+0 and sp+4), then when the function returns the calling code is the one that unallocates those two variables by adding 8 to the stack pointer. So in the above code using i and returning i by value, the C calling convention for that processor knows where to get the return value, and once that value is captured the stack memory holding that value is no longer needed. My understanding is that pascal, say borland turbo pascal for example the calee cleaned up the stack. So the caller would put the two variables on the stack and the function being called would clean up the stack. Not a bad idea as far as stack management goes, you can nest much deeper this way. there are pros and cons to both approaches.

Resources