How does macOS allocate stack and heap for a process? - macos

I want to know how macOS allocate stack and heap memory for a process, i.e. the memory layout of a process in macOS. I only know that the segments of a mach-o executable are loaded into pages, but I can't find a segment that correspond to stack or heap area of a process. Is there any document about that?

Stacks and heaps are just memory. The only think that makes a stack a stack or a heap or a heap is the way it is accessed. Stacks and heaps are allocated the same way all memory is: by mapping pages into the logical address space.

Let's take a step back - the Mach-o format describes mapping the binary segments into virtual memory. Importantly the memory pages you mentioned have read write and execute permissions. If it's an executable(i.e. not a dylib) it must contain the __PAGEZERO segment with no permissions at all. This is the safe guard area to prevent accessing low addresses of virtual memory by accident (here falls the infamous Null pointer exception and such if attempting to access zero memory address).
__TEXT read executable (typically without write) segment follows which in virtual memory will contain the file representation itself. This implies all the executable code lives here. Also immmutable data like string constants.
The order may vary, but usually next you will encounter __LINKEDIT read only segment. This is the segment dyld uses to setup externally loaded functions, this is too broad to cover here, but there are numerous answers on the topic.
Finally we have the readable writable __DATA segment the first place a process can actually write to. This is used for global/static variables, external addresses to calls populated by dyld.
We have roughly covered the process initial setup when it will launch through either LC_UNIXTHREAD or in modern MacOS (10.7+) LC_MAIN. This starts the process main thread. Each thread must contain it's own stack. The creation of it is handled by operating system (including allocating it). Notice so far the process has no awareness of the heap at all (it's the operating system that's doing the heavy lifting to prepare the stack).
So to sum up so far we have 2 independent sources of memory - the process memory representing the Mach-o structure (size is fixed and determined by the executable structure) and the main thread stack (also with predefined size). The process is about to run a C-like main function , any local variables declared would move the thread stack pointer, likewise any calls to functions (local and external) to at least setup the stack frame for return address. Accessing a global/static variable would reference the __DATA segment virtual memory directly.
Reserving stack space in x86-64 assembly would look like this:
sub rsp,16
There are some great SO anwers on System V / AMD64 ABI (which includes MacOS) requirements for stack alignment like this one
Any new thread created will have its own stack to allow setting up stack frames for local variables and calling functions.
Now we can cover heap allocation - which is mitigated by the libSystem (aka MacOS C standard library) delivering the malloc/free. Internally this is handled by mmap & munmap system calls - the kernel API for managing memory pages.
Using those system calls directly is possible, but might turned out inefficient, thus an internal memory pool is utilised by malloc/free to limit the number of system calls (which are costly to make).
The changing addresses you mentioned in the comment are caused by:
ASLR aka PIE (position independent code) for process memory , which is a security measure randomizing the start of virtual memory
Thread local stacks being prepared by the operating system

Related

what is the use of attaching static data along with a program when it is loaded on the main memory?

When the operating system loads Program onto the main memory , it , along with the stack and heap memory , also attaches the static data along with it. I googled about what is present in the static data which said it contained the global variables and static variables. But I am confused as both of these are already present in the text file of the program then why do we add them seperately?
The data in the executable is often referred as the data segment. The CPU doesn't interact with the hard-disk but only with RAM. The data segment must thus be loaded in RAM before the CPU can access it. The file of the executable is not really a text file. It is an executable so it has a different extension. Text files often refer to an actual file with a .txt extension.
With that said, you also asked another question not long ago (If the amount of stack memory provided to a program is fixed then why does it grow downwards in the process architecture? Or am I getting it wrong?) so I will try to give some insight for both of these in this same answer.
I don't know much about caching and low level inner CPU workings but, today mostly, the CPU doesn't even operate on RAM directly. It will load a bunch of RAM chunks into the cache and make operations on them and keep RAM-cache consistency by implementing complex mechanisms. The OS also has its role to play in RAM-cache consistency but, like I said, I am far from an expert here. Other than that, caching is mostly transparent to the OS. The CPU handles it and the OS simply provides instructions to the CPU which executes them.
Today, you have paging used by most OS and implemented on most CPU architectures. With paging, every process sees a full contiguous virtual address space. The virtual address space is accessed contiguously and the hardware MMU translates those addresses to physical ones automatically by crossing the page tables. The OS is responsible to make sure the page tables are consistent and the MMU does the rest of the job (for more info read: What is paging exactly? OSDEV). If you understand paging well, things become much clearer.
For a process, there is mostly 3 types of memory. There is the stack (often called automatic storage), the heap and the static/global data. I will attempt to give precision on all of these to give a global picture.
The stack is given a maximum size when the process begins. The OS handles that and creates the page tables and places the proper address in the stack pointer register so that stack accesses reach the proper region of physical memory. The stack is automatic storage which means that it isn't handled manually by the high level programmer. For example, in C/C++, the stack is managed by the compiler which, at the entry of a function, will create a stack frame and place offsets from the stack base pointer in the instructions. Every local variable (within a function) will be accessed with a relative negative offset from the stack base pointer. What the compiler needs to do is to create a stack frame of the proper size so that there will be enough place for all local variables of a particular function (for more info on the stack see: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?).
For the heap, the OS reserves a very big amount of virtual memory. Today, virtual memory is very big (2^48 bytes or more). The amount of heap available for each process is often only limited by the amount of physical memory available to back virtual memory allocations. For example, a process could use malloc() to allocate 4KB of memory in C. The OS will be called with a system call by the libc library which is an implementation of the C standard library. The OS will then reserve a page of the virtual memory available for the heap and change the page tables so that accessing that portion of virtual memory will translate to somewhere in RAM (probably somewhere another process wasn't already using).
The static/global data are simply placed in the executable in the data segment. The data segment is loaded in the virtual memory alongside the text segment. The text segment will thus be able to access this data often using RIP-relative addressing.

How does the linux kernel avoid the stack overwriting the text (instructions)?

I was curious about how the kernel prevents the stack from growing too big, and I found this Q/A:
Q: how does the linux kernel enforce stack size limits?
A: The kernel can control this due to the virtual memory. The virtual
memory (also known as memory mapping), is basically a list of virtual
memory areas (base + size) and a target physically memory area that
the kernel can manipulate that is unique to each program. When a
program tries to access an address that is not on this list, an
exception happens. This exception will cause a context switch into
kernel mode. The kernel can look up the fault. If the memory is to
become valid, it will be put into place before the program can
continue (swap and mmap not read from disk yet for instance) or a
SEGFAULT can be generated.
In order to decide the stack size limit, the kernel simply manipulates
the virtual memory map. - Stian Skjelstad
But I didn't quite find this answer satisfactory. "When a program tries to access an address that is not on this list, an exception happens." - But wouldn't the text section (instructions) of the program be part of the virtual memory map?
I'm asking about how the kernel enforces the stack size of user programs.
There's a growth limit, set with ulimit -s for the main stack, that will stop the stack from getting anywhere near .text. (And the guard pages below that make sure there's a segfault if the stack does overflow past the growth limit.) See How is Stack memory allocated when using 'push' or 'sub' x86 instructions?. (Or for thread stacks (not the main thread), stack memory is just a normal mmap allocation with no growth; the only lazy allocation is physical pages to back the virtual ones.)
Also, .text is a read+exec mapping of the executable, so there's no way to modify it without calling mprotect first. (It's a private mapping, so doing so would only affect the pages in memory, not the actual file. This is how text relocations work: runtime fixups for absolute addresses, to be fixed up by the dynamic linker.)
The actual mechanism for limiting growth is by simply not extending the mapping and allocating a new page when the process triggers a hardware page fault with the stack pointer below the existing stack area. Thus the page fault is an invalid one, instead of a soft aka minor for the normal stack-growth case, so a SIGSEGV is delivered.
If a program used alloca or a C99 VLA with an unchecked size, malicious input could make it jump over any guard pages and into some other read/write mapping such as .data or stuff that's dynamically allocated.
To harden buggy code against that so it segfaults instead of actually allowing a stack clash attack, there are compiler options that make it touch every intervening page as the stack grows, so it's certain to set off the "tripwire" in the form of an unmapped guard page below the stack-growth limit. See Linux process stack overrun by local variables (stack guarding)
If you set ulimit -s unlimited could you maybe grow the stack into some other mapping, if Linux truly does allow unlimited growth in that case without reserving a guard page as you approach another mapping.

Does memory layout of a program depend on address binding technique?

I have learned that with run-time address binding, the program can be allocated frames in the physical memory non-contiguously. Also, as described here and here, every segment of the program in the logical address space is contiguous, but not all segments are placed together side-by-side. The text, data, BSS and heap segments are placed together, but the stack segment is not. In other words, there are pages between the heap and the stack segments (between the program break and stack top) in the logical address space that are not mapped to any frames in the physical address space, thus implying that the logical address space is non-contiguous in the case of run-time address binding.
But what about the memory layout in the case of compile-time or load-time binding ? Now that the logical address space in not an abstract address space but the actual physical address space, how is a program laid out in the physical memory ? More specifically, how is the stack segment placed in the physical address space of a program ? Is it placed together with the rest of the segments or separately just as in the case of run-time binding ?
To answer your quesitons, I first have to explain a bit about stack and heap allocation in modern operating systems.
Stack as the name suggest is the continues memory allocation, where cpu uses push and pop commands to add/remove data from top of the stack. I assume that you already know how stack works. process stores - return address, function arguments and local variables over stack. Every time a function is called, more data is pushed (it could ultimately lead to stack overflow is no data is popped ever - infinite recursion?). Stack size is fixed for a program when it is loaded in the memory. Most of the programming language lets you decide stack size during compilation. If not, they will decide a default. On Linux, maximum stack size(hard limit) is limited by ulimit. You can check and set the size by ulimit -s.
Heap space however, has no upper limit in *nix systems(depends, confirm it using ulimit -v), every program starts with a default/set amount of heap and can increase as much needed. Heap space in a process is actually two linked lists, free and used blocks. Whenever memory allocation is required from heap, one or more free blocks are combined to form a bigger block and allocated to the used list as a single block. Freeing up means removing a block from the used list to the free list. After Freeing the blocks, heap can have external fragmentation. Now if the the number of free blocks can't contain the whole data, process will request more memory from the OS, Generally the newer blocks are allocated from a higher address. Thus we show upward direction diagram for heap growth. I rephrase - Heap does not allocate memory continuously in a higher direction.
Now to answer your questions.
With compile-time or load-time address binding, how are the stack and
the heap segments placed in the physical address space of a program ?
Fixed stack is allocated at compile time, with some heap memory. How are placed have been explained above.
Is the space between the heap and the stack reserved for the program
or is it available for the OS to be used for other programs ?
Yes it is reserved for the program. Process however can request more memory to add free blocks in its heap. It is different than sharing it's own heap.
Note: There are lots of topic which can be covered here as the question is broad. Some of them are - garbage collection, block selection, shared memory etc. I will soon add the references here.
References:-
Memory Management in JVM
Stack vs Heap
Heap memory allocation strategies

Java Card memory leak in for loop?

I know that Java Card VM's doesn't have have a garbage collector, but what happens with a for loop:
for(short x=0;x<10;x++)
{}
Does the x variable get utilized after the for loop, or it turns into garbage?
Just in case I have a transient byte array called index from size of 2 (instead of i in for loop) and I use the array in for loops:
for(index[0]=0;index[0]<10;index[0]++)
{}
But it is a little slower than the first version. If I use a normal variable for the index in a for loop then it gets really slow.
So, what happens with the x variable in the first for loop? Is it safe to use for loops like that, or not?
The x variable does not really exist in byte code. There are operations on a location in the Java stack that represents x (be it Java byte code or the code after conversion by the Java Card converter). Now the Java stack is a virtual stack. This stack is implemented on a CPU that has registers and a non-virtual stack. In general, if there are enough registers available, then the variable x is simply put in a register until it is out of scope. The register may of course be reused. The CPU stack itself is a LIFO (last in first out) queue in transient memory (RAM). The stack continuously grows and shrinks during the execution of the byte code that makes up your Applet. Like registers, the stack memory is reused over and over again. All the local variables (those defined inside code blocks as well as method arguments) are treated this way.
If you put your variable in a transient array then you are putting the variable on a RAM based heap. The Java Card RAM heap will never go out of scope. That means that if you update the value that the change needs to be written to transient memory. That is of course slower than a localized update of a CPU register, as you found by experimentation. Usually the memory in the transient memory is never freed. That said, you can of course reuse the memory for other purposes, as long as you have a reference to the array. Note that the references themselves (the index in index[0]) may be either in persistent memory (EEPROM or flash) or transient memory.
It's unclear what you mean with "normal variable". If this is something that has been created with new or if it is a field in an object instance then it persists in the heap in persistent memory (EEPROM or flash). EEPROM and flash have limited amount of write cycles and writing to EEPROM or flash is much much slower than writing to RAM.
Java Card contains two kinds of transient memory: CLEAR_ON_RESET and CLEAR_ON_DESELECT. The difference between the two is that the CLEAR_ON_RESET allows memory to be shared between Applet instances while the CLEAR_ON_DESELECT allows memory to be reused by different Applets.
Java Card classic doesn't contain a garbage collector that runs during Applet execution, you can usually only request garbage collection during startup using JCSystem.requestObjectDeletion() which will clean up the memory that is not referenced anymore, both on the heap in transient memory as well as in persistent memory. Cleaning up the memory means scanning all the memory, marking all the unreferenced blocks and then compacting the memory. This is similar to defragmenting a hard disk; it can take a uncomfortably long time.
ROM is filled during the manufacturing phase. It may contain the operating system, the Java Card API implementation, byte code (including constants) of pre-loaded applets etc. It only be read in the field, so it isn't of any consequence to the question asked.
Let us have a short introduction about the Memory. In brief, there is 3 types of memories in the Smart cards as below:
ROM (And sometimes FLASH)
EEPROM
RAM
ROM:
The card's OS and Java Card APIs and some factory proprietary packages stored here. The contents of this memory is fixed and you can't modify it. Writing in this memory happens only once in the chip production and the process is named Masking.
EEPROM:
This is modifiable memory that your applets load into and it is consist of 4 sections named as below:
Text : also known as code segment contains the machine instructions of the program. The code can be thought of like the text of a novel: It tells the story of what the program does
Data : contains the static data of the program, i.e. the variables that exist throughout program execution. Global variables in a C or C++ program are static, as are variables declared as static in C, C++, or Java.
Heap : is a pool of memory used for dynamically allocated memory, such as with malloc() in C or new in C++ and Java.
Stack : contains the system stack, which is used as temporary storage.
A power-less (Card tearing for example) doesn't have any effect on the contents of this memory.
RAM:
This is a modifiable type of memory also. There is three main difference between RAM and EEPROM:
RAM is really faster than EEPROM. (1000 times faster)
Contents of RAM will destroyed in the power-loss.
The number of writing in EEPROM is limited (Typically 100.000 times.) while RAM has a really higher number.
And What now?
When you write for(short x=0; x<10; x++), you define x as a local variable. Local variables stores in Stack. The stack pointer will reset on the power loss and the used stack part will reclaim. So the main problem of the absence of Garbage Collector is about Heap.
i.e when you define a local variable using new keyword, you specify that part of Heap to a local variable for ever. When the Runtime Environment finished that method, the object will destroy and become unavailable, while that section of Heap doesn't reclaimed. So you will lose that part of Heap. The case that you used for yourfor loop, seems OK and doesn't make any problem because you didn't use new keyword.
Note that in newer versions of Java Card (2.2.2 and higher) there is a manual Garbage Collector (look JCSystem.requestObjectDeletion documentation). But consider that it is really slow and even dangerous in some situations(Look Java Card power less during garbage collection question).

OS X, gcc, x86, segmentation, paging, seg fault, bus error

In the case of osx, gcc, modern x86:
How is the x86 segmentation h/w and paging h/w used?
For the most part1, the segmentation hardware isn't used. Most current OSes set CS, DS, SS, and ES to all point to all memory (base address of 0, limit of 4Gig). Each is set to allow full access to all that memory (CS->execute, DS, ES, SS->read/write).
That means nearly all real access control is done with the paging unit. The basic idea is that pages accessible by a particular process are mapped to that process. Pages that are in virtual memory are mapped, but marked not present, so attempting to read/write them will cause an exception; the OS reads the data from the paging file into RAM, marks the data as present, and re-starts the instruction.
As far as how pages are marked, most executable code will be marked read-only, and will be shared between processes. Most data and stack will be marked read/write and will not be shared. Depending on the exact system, stack space will usually have the NX bit set to prevent it from being executed.
There are a few other bits and pieces that are a bit different. For example, most OSes (including OS/X, if memory serves) set up a stack guard page -- a page at the top of the stack that allows no access. When/if you try to access it, the OS catches an exception, allocates another page of stack space, and re-starts the instruction. This means you can allocate (say) 4 megabytes of address space for the stack, but only allocate actual RAM for roughly the space that's been used (obviously in page-sized increments).
The hardware also supports "large" (4 megabyte) pages. These are used primarily for mapping large chunks of contiguous memory like the part of the memory on the graphics card that's directly visible to the CPU.
That's only a very high-level view, but it's hard to provide more detail without knowing what you care about. Trying to cover all the use of paging by an entire OS could occupy an entire (large) book.
1 Windows (unlike most other systems) does make a minimal use of segmentation -- it sets up FS as a pointer to a Thread Information Block (TIB), which gives access to some basic information about the current thread. This is useful (and used) particularly by Windows' Structured Exception Handling (and Vectored Exception Handling).

Resources