I want to move the stack of my threads, I saw this sentence in
https://en.wikipedia.org/wiki/Win32_Thread_Information_Block
A process should be free to move the stack of its threads as long as it updates the information stored in the TIB accordingly. A few fields are key to this matter: stack base, stack limit, deallocation stack, and guaranteed stack bytes, respectively stored at offsets 0x8, 0x10, 0x1478 and 0x1748 in 64 bits. Different Windows kernel functions read and write these values, specially to distinguish stack overflows from other read/write page faults (a read or write to a page guarded among the stack limits in guaranteed stack bytes will generate a stack-overflow exception instead of an access violation). The deallocation stack is important because Windows API allows to change the amount of guarded pages: the function SetThreadStackGuarantee allows both read the current space and to grow it. In order to read it, it reads the GuaranteedStackBytes field, and to grow it, it uses has to uncommit stack pages. Setting stack limits without setting DeallocationStack will probably cause odd behavior in SetThreadStackGuarantee. For example, it will overwrite the stack limits to wrong values. Different libraries call SetThreadStackGuarantee, for example the .NET CLR uses it for setting up the stack of their threads.
I tried the SetThreadStackGuarantee function only to modify the GS:[0x1748] field, is there any other api under windows that can modify GS:[0x08], GS:[0x10] and GS:[0x1478]?
Related
When the operating system loads Program onto the main memory , it , along with the stack and heap memory , also attaches the static data along with it. I googled about what is present in the static data which said it contained the global variables and static variables. But I am confused as both of these are already present in the text file of the program then why do we add them seperately?
The data in the executable is often referred as the data segment. The CPU doesn't interact with the hard-disk but only with RAM. The data segment must thus be loaded in RAM before the CPU can access it. The file of the executable is not really a text file. It is an executable so it has a different extension. Text files often refer to an actual file with a .txt extension.
With that said, you also asked another question not long ago (If the amount of stack memory provided to a program is fixed then why does it grow downwards in the process architecture? Or am I getting it wrong?) so I will try to give some insight for both of these in this same answer.
I don't know much about caching and low level inner CPU workings but, today mostly, the CPU doesn't even operate on RAM directly. It will load a bunch of RAM chunks into the cache and make operations on them and keep RAM-cache consistency by implementing complex mechanisms. The OS also has its role to play in RAM-cache consistency but, like I said, I am far from an expert here. Other than that, caching is mostly transparent to the OS. The CPU handles it and the OS simply provides instructions to the CPU which executes them.
Today, you have paging used by most OS and implemented on most CPU architectures. With paging, every process sees a full contiguous virtual address space. The virtual address space is accessed contiguously and the hardware MMU translates those addresses to physical ones automatically by crossing the page tables. The OS is responsible to make sure the page tables are consistent and the MMU does the rest of the job (for more info read: What is paging exactly? OSDEV). If you understand paging well, things become much clearer.
For a process, there is mostly 3 types of memory. There is the stack (often called automatic storage), the heap and the static/global data. I will attempt to give precision on all of these to give a global picture.
The stack is given a maximum size when the process begins. The OS handles that and creates the page tables and places the proper address in the stack pointer register so that stack accesses reach the proper region of physical memory. The stack is automatic storage which means that it isn't handled manually by the high level programmer. For example, in C/C++, the stack is managed by the compiler which, at the entry of a function, will create a stack frame and place offsets from the stack base pointer in the instructions. Every local variable (within a function) will be accessed with a relative negative offset from the stack base pointer. What the compiler needs to do is to create a stack frame of the proper size so that there will be enough place for all local variables of a particular function (for more info on the stack see: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?).
For the heap, the OS reserves a very big amount of virtual memory. Today, virtual memory is very big (2^48 bytes or more). The amount of heap available for each process is often only limited by the amount of physical memory available to back virtual memory allocations. For example, a process could use malloc() to allocate 4KB of memory in C. The OS will be called with a system call by the libc library which is an implementation of the C standard library. The OS will then reserve a page of the virtual memory available for the heap and change the page tables so that accessing that portion of virtual memory will translate to somewhere in RAM (probably somewhere another process wasn't already using).
The static/global data are simply placed in the executable in the data segment. The data segment is loaded in the virtual memory alongside the text segment. The text segment will thus be able to access this data often using RIP-relative addressing.
I was curious about how the kernel prevents the stack from growing too big, and I found this Q/A:
Q: how does the linux kernel enforce stack size limits?
A: The kernel can control this due to the virtual memory. The virtual
memory (also known as memory mapping), is basically a list of virtual
memory areas (base + size) and a target physically memory area that
the kernel can manipulate that is unique to each program. When a
program tries to access an address that is not on this list, an
exception happens. This exception will cause a context switch into
kernel mode. The kernel can look up the fault. If the memory is to
become valid, it will be put into place before the program can
continue (swap and mmap not read from disk yet for instance) or a
SEGFAULT can be generated.
In order to decide the stack size limit, the kernel simply manipulates
the virtual memory map. - Stian Skjelstad
But I didn't quite find this answer satisfactory. "When a program tries to access an address that is not on this list, an exception happens." - But wouldn't the text section (instructions) of the program be part of the virtual memory map?
I'm asking about how the kernel enforces the stack size of user programs.
There's a growth limit, set with ulimit -s for the main stack, that will stop the stack from getting anywhere near .text. (And the guard pages below that make sure there's a segfault if the stack does overflow past the growth limit.) See How is Stack memory allocated when using 'push' or 'sub' x86 instructions?. (Or for thread stacks (not the main thread), stack memory is just a normal mmap allocation with no growth; the only lazy allocation is physical pages to back the virtual ones.)
Also, .text is a read+exec mapping of the executable, so there's no way to modify it without calling mprotect first. (It's a private mapping, so doing so would only affect the pages in memory, not the actual file. This is how text relocations work: runtime fixups for absolute addresses, to be fixed up by the dynamic linker.)
The actual mechanism for limiting growth is by simply not extending the mapping and allocating a new page when the process triggers a hardware page fault with the stack pointer below the existing stack area. Thus the page fault is an invalid one, instead of a soft aka minor for the normal stack-growth case, so a SIGSEGV is delivered.
If a program used alloca or a C99 VLA with an unchecked size, malicious input could make it jump over any guard pages and into some other read/write mapping such as .data or stuff that's dynamically allocated.
To harden buggy code against that so it segfaults instead of actually allowing a stack clash attack, there are compiler options that make it touch every intervening page as the stack grows, so it's certain to set off the "tripwire" in the form of an unmapped guard page below the stack-growth limit. See Linux process stack overrun by local variables (stack guarding)
If you set ulimit -s unlimited could you maybe grow the stack into some other mapping, if Linux truly does allow unlimited growth in that case without reserving a guard page as you approach another mapping.
I want to know how macOS allocate stack and heap memory for a process, i.e. the memory layout of a process in macOS. I only know that the segments of a mach-o executable are loaded into pages, but I can't find a segment that correspond to stack or heap area of a process. Is there any document about that?
Stacks and heaps are just memory. The only think that makes a stack a stack or a heap or a heap is the way it is accessed. Stacks and heaps are allocated the same way all memory is: by mapping pages into the logical address space.
Let's take a step back - the Mach-o format describes mapping the binary segments into virtual memory. Importantly the memory pages you mentioned have read write and execute permissions. If it's an executable(i.e. not a dylib) it must contain the __PAGEZERO segment with no permissions at all. This is the safe guard area to prevent accessing low addresses of virtual memory by accident (here falls the infamous Null pointer exception and such if attempting to access zero memory address).
__TEXT read executable (typically without write) segment follows which in virtual memory will contain the file representation itself. This implies all the executable code lives here. Also immmutable data like string constants.
The order may vary, but usually next you will encounter __LINKEDIT read only segment. This is the segment dyld uses to setup externally loaded functions, this is too broad to cover here, but there are numerous answers on the topic.
Finally we have the readable writable __DATA segment the first place a process can actually write to. This is used for global/static variables, external addresses to calls populated by dyld.
We have roughly covered the process initial setup when it will launch through either LC_UNIXTHREAD or in modern MacOS (10.7+) LC_MAIN. This starts the process main thread. Each thread must contain it's own stack. The creation of it is handled by operating system (including allocating it). Notice so far the process has no awareness of the heap at all (it's the operating system that's doing the heavy lifting to prepare the stack).
So to sum up so far we have 2 independent sources of memory - the process memory representing the Mach-o structure (size is fixed and determined by the executable structure) and the main thread stack (also with predefined size). The process is about to run a C-like main function , any local variables declared would move the thread stack pointer, likewise any calls to functions (local and external) to at least setup the stack frame for return address. Accessing a global/static variable would reference the __DATA segment virtual memory directly.
Reserving stack space in x86-64 assembly would look like this:
sub rsp,16
There are some great SO anwers on System V / AMD64 ABI (which includes MacOS) requirements for stack alignment like this one
Any new thread created will have its own stack to allow setting up stack frames for local variables and calling functions.
Now we can cover heap allocation - which is mitigated by the libSystem (aka MacOS C standard library) delivering the malloc/free. Internally this is handled by mmap & munmap system calls - the kernel API for managing memory pages.
Using those system calls directly is possible, but might turned out inefficient, thus an internal memory pool is utilised by malloc/free to limit the number of system calls (which are costly to make).
The changing addresses you mentioned in the comment are caused by:
ASLR aka PIE (position independent code) for process memory , which is a security measure randomizing the start of virtual memory
Thread local stacks being prepared by the operating system
I'm messing around with VirtualAlloc and dynamic code generation, and I've become curious about something.
The first parameter of VirtualAlloc specifies the start of the address range to be allocated, or more accurately, the page containing that address specifies the start of the page range to be allocated. Right?
I started wondering. Could you just make a bunch of space on the stack and "allocate" that memory with VirtualAlloc? For instance, to change its permissions to PAGE_EXECUTE_READWRITE?
(As an extension of the above, I'm curious where exactly the stack is in a Windows process. How is it set up? What sets it up?)
tl;dr Can you "allocate" stack space with VirtualAlloc?
Stack space is allocated by VirtualAlloc and the MEM_RESERVE flag (or perhaps directly using the underlying syscall) when a thread is created. This causes a chuck of the process's address space to be reserved for that thread stack.
A guard page is used to cause an access-violation when the stack grows past the region which is actually committed. The OS handles this automatically, by committing additional memory (if there is enough reserved space) or generating EXCEPTION_STACK_OVERFLOW to the process if the edge of the reserved area is reached. In the first case, a new guard page is set up. In the second, recreating the guard page is an important step if you try to handle that exception and recover.
You could use VirtualAlloc and VirtualProtect to precommit your thread's stack. But they don't touch the stack pointer, so they can't be used for stack allocation (code using the stack pointer would happily reuse "your" allocation for automatic variables, function parameters, etc). To allocate space from the stack, you need to adjust the stack pointer. Most C and C++ compilers provide an _alloca() intrinsic for doing this.
If you're doing dynamic code generation, don't use the stack for that. Non-executable stack is a valuable protection against remote execution vulnerabilities. You certainly can use VirtualAlloc for dynamic allocation in specialized cases like this, instead of the general-purpose allocators HeapAlloc and malloc and new[]. The general-purpose allocators all ultimately get their memory from VirtualAlloc, but then parcel it out in chunks that don't line up with page boundaries.
I am interested in the layout of an executable and dynamic memory allocation using stack and how the processor and kernel together manage the stack region, like during function calls and other scenarios of using stack based memory allocation. Also how stack overflow and other hazards associated with this model occur, are their other designs of code execution that are not stack based and don't have such issues. A video or an animation would be of great help.
Typically (any processor, not just x86) there is one ram address space, and typically the program is in lower memory and grows upwards as you run. Say your program is 0x1000 bytes and is loaded at 0x0000 then you do a malloc of 0x3000 bytes the address returned would be 0x1000 in this hypothetical situation and now the lower 0x4000 bytes are being actively used by the program. Additional mallocs continue to grow in this way. Free()s do not necessary cause this consumption to go down, it depends on how the memory is managed and the programs mixture of malloc()s and free()s.
The stack though normally goes from the top down. Say 0x10000 is the address the stack pointer starts at. Say you have a function that has three 32 bit unsigned int variables, and no parameters are passed in, you need three stack locations to hold those variables (assuming no optimization has reduced that requirement) so upon entry of the function the stack pointer is reduced by 3*4 = 12 bytes, so the stack pointer is changed to 0xFFF4, one of your variables is at address 0xFFF4+0 one at 0xFFF4+4 and the third at 0xFFF4+8. If that function calls another function then the stack pointer continues to move toward zero in memory. And as you continue to malloc() your used program memory grows upward. Unchecked they will collide, and the code needed to do that checking is cost prohibitive enough that it is rarely used. This is why local variables are good for optimizing and a few other things but bad because stack consumption is often non-deterministic or at least the analysis is not done by the average programmer.
On ISAs (instruction set architectures) like x86 where there is a limited number of usable registers then functions often need to pass arguments on the stack as well. The rules governing where and how things are passed and returned is defined and well understood by the compiler, this is not some random thing. Anyway, in addition to leaving room for the local variables some of the arguments to the function are on the stack and sometimes the return value is on the stack. In particular with x86, each function call causes the stack to grow downward, and functions calling functions makes that worse. Think about what recursion can do to your stack.
What are your alternatives? Use an instruction set with more registers with a function calling spec that uses more registers and less stack. Use fewer arguments when calling functions. Use fewer local variables. Malloc less. Use a good compiler with a good optimizer as well as help the optimizer by using easy to optimize habits when coding.
Realistically though, to have a generically useful processor for which you write generically useful programs you have to have a stack and the possibility that the stack overflows and/or collides with the heap.
Now the segmented memory model of the x86 as well as mmus in general give you the opportunity to keep the program memory and stack well away from each other. Also protection mechanisms can be used that if either the heap or the stack venture outside their allocated space a protection fault occurs. Still an oversight by the programmer but is easier to know what happened and debug it than the random side effects that occur when the stack grows down into program memory space. Using a protection mechanism like this is much easier solution to help the programmer control the stack growth than building something into the code generated by the compiler to check for a collision on every function call and malloc.
Another pitfall which is often asked in job interviews is something along the lines of:
int * myfun ( int a )
{
int i;
i=a+7;
return(&i);
}
This can take many forms, the thing to understand is that the variable i is temporarily allocated on the stack and is only allocated while the function is executing, when the function returns the stack pointer frees up the memory allocated for i and the next function called may very well clobber that memory. So by returning the address to a variable stored on the stack is a bad idea. Code that does something like this may run properly for weeks, months, years before being detected.
Now this is acceptable even on stack based cpus (the zylin zpu for example).
int myfun ( int a )
{
int i;
i=a+7;
return(i);
}
Partly because there isnt much you can do other than use globals (yes this specific case does not require the additional variable i, but assume your code is complicated enough that you need that local return variable), the second is because in C, the calling code frees up its portion of the stack. Meaning on an x86 for example if you call a function with two parameters on the stack, lets say two 4 byte ints, the calling code moves the stack pointer down by 8 and places those two parameters in that memory (sp+0 and sp+4), then when the function returns the calling code is the one that unallocates those two variables by adding 8 to the stack pointer. So in the above code using i and returning i by value, the C calling convention for that processor knows where to get the return value, and once that value is captured the stack memory holding that value is no longer needed. My understanding is that pascal, say borland turbo pascal for example the calee cleaned up the stack. So the caller would put the two variables on the stack and the function being called would clean up the stack. Not a bad idea as far as stack management goes, you can nest much deeper this way. there are pros and cons to both approaches.