OS X, gcc, x86, segmentation, paging, seg fault, bus error - macos

In the case of osx, gcc, modern x86:
How is the x86 segmentation h/w and paging h/w used?

For the most part1, the segmentation hardware isn't used. Most current OSes set CS, DS, SS, and ES to all point to all memory (base address of 0, limit of 4Gig). Each is set to allow full access to all that memory (CS->execute, DS, ES, SS->read/write).
That means nearly all real access control is done with the paging unit. The basic idea is that pages accessible by a particular process are mapped to that process. Pages that are in virtual memory are mapped, but marked not present, so attempting to read/write them will cause an exception; the OS reads the data from the paging file into RAM, marks the data as present, and re-starts the instruction.
As far as how pages are marked, most executable code will be marked read-only, and will be shared between processes. Most data and stack will be marked read/write and will not be shared. Depending on the exact system, stack space will usually have the NX bit set to prevent it from being executed.
There are a few other bits and pieces that are a bit different. For example, most OSes (including OS/X, if memory serves) set up a stack guard page -- a page at the top of the stack that allows no access. When/if you try to access it, the OS catches an exception, allocates another page of stack space, and re-starts the instruction. This means you can allocate (say) 4 megabytes of address space for the stack, but only allocate actual RAM for roughly the space that's been used (obviously in page-sized increments).
The hardware also supports "large" (4 megabyte) pages. These are used primarily for mapping large chunks of contiguous memory like the part of the memory on the graphics card that's directly visible to the CPU.
That's only a very high-level view, but it's hard to provide more detail without knowing what you care about. Trying to cover all the use of paging by an entire OS could occupy an entire (large) book.
1 Windows (unlike most other systems) does make a minimal use of segmentation -- it sets up FS as a pointer to a Thread Information Block (TIB), which gives access to some basic information about the current thread. This is useful (and used) particularly by Windows' Structured Exception Handling (and Vectored Exception Handling).

Related

what is the use of attaching static data along with a program when it is loaded on the main memory?

When the operating system loads Program onto the main memory , it , along with the stack and heap memory , also attaches the static data along with it. I googled about what is present in the static data which said it contained the global variables and static variables. But I am confused as both of these are already present in the text file of the program then why do we add them seperately?
The data in the executable is often referred as the data segment. The CPU doesn't interact with the hard-disk but only with RAM. The data segment must thus be loaded in RAM before the CPU can access it. The file of the executable is not really a text file. It is an executable so it has a different extension. Text files often refer to an actual file with a .txt extension.
With that said, you also asked another question not long ago (If the amount of stack memory provided to a program is fixed then why does it grow downwards in the process architecture? Or am I getting it wrong?) so I will try to give some insight for both of these in this same answer.
I don't know much about caching and low level inner CPU workings but, today mostly, the CPU doesn't even operate on RAM directly. It will load a bunch of RAM chunks into the cache and make operations on them and keep RAM-cache consistency by implementing complex mechanisms. The OS also has its role to play in RAM-cache consistency but, like I said, I am far from an expert here. Other than that, caching is mostly transparent to the OS. The CPU handles it and the OS simply provides instructions to the CPU which executes them.
Today, you have paging used by most OS and implemented on most CPU architectures. With paging, every process sees a full contiguous virtual address space. The virtual address space is accessed contiguously and the hardware MMU translates those addresses to physical ones automatically by crossing the page tables. The OS is responsible to make sure the page tables are consistent and the MMU does the rest of the job (for more info read: What is paging exactly? OSDEV). If you understand paging well, things become much clearer.
For a process, there is mostly 3 types of memory. There is the stack (often called automatic storage), the heap and the static/global data. I will attempt to give precision on all of these to give a global picture.
The stack is given a maximum size when the process begins. The OS handles that and creates the page tables and places the proper address in the stack pointer register so that stack accesses reach the proper region of physical memory. The stack is automatic storage which means that it isn't handled manually by the high level programmer. For example, in C/C++, the stack is managed by the compiler which, at the entry of a function, will create a stack frame and place offsets from the stack base pointer in the instructions. Every local variable (within a function) will be accessed with a relative negative offset from the stack base pointer. What the compiler needs to do is to create a stack frame of the proper size so that there will be enough place for all local variables of a particular function (for more info on the stack see: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?).
For the heap, the OS reserves a very big amount of virtual memory. Today, virtual memory is very big (2^48 bytes or more). The amount of heap available for each process is often only limited by the amount of physical memory available to back virtual memory allocations. For example, a process could use malloc() to allocate 4KB of memory in C. The OS will be called with a system call by the libc library which is an implementation of the C standard library. The OS will then reserve a page of the virtual memory available for the heap and change the page tables so that accessing that portion of virtual memory will translate to somewhere in RAM (probably somewhere another process wasn't already using).
The static/global data are simply placed in the executable in the data segment. The data segment is loaded in the virtual memory alongside the text segment. The text segment will thus be able to access this data often using RIP-relative addressing.

How does the linux kernel avoid the stack overwriting the text (instructions)?

I was curious about how the kernel prevents the stack from growing too big, and I found this Q/A:
Q: how does the linux kernel enforce stack size limits?
A: The kernel can control this due to the virtual memory. The virtual
memory (also known as memory mapping), is basically a list of virtual
memory areas (base + size) and a target physically memory area that
the kernel can manipulate that is unique to each program. When a
program tries to access an address that is not on this list, an
exception happens. This exception will cause a context switch into
kernel mode. The kernel can look up the fault. If the memory is to
become valid, it will be put into place before the program can
continue (swap and mmap not read from disk yet for instance) or a
SEGFAULT can be generated.
In order to decide the stack size limit, the kernel simply manipulates
the virtual memory map. - Stian Skjelstad
But I didn't quite find this answer satisfactory. "When a program tries to access an address that is not on this list, an exception happens." - But wouldn't the text section (instructions) of the program be part of the virtual memory map?
I'm asking about how the kernel enforces the stack size of user programs.
There's a growth limit, set with ulimit -s for the main stack, that will stop the stack from getting anywhere near .text. (And the guard pages below that make sure there's a segfault if the stack does overflow past the growth limit.) See How is Stack memory allocated when using 'push' or 'sub' x86 instructions?. (Or for thread stacks (not the main thread), stack memory is just a normal mmap allocation with no growth; the only lazy allocation is physical pages to back the virtual ones.)
Also, .text is a read+exec mapping of the executable, so there's no way to modify it without calling mprotect first. (It's a private mapping, so doing so would only affect the pages in memory, not the actual file. This is how text relocations work: runtime fixups for absolute addresses, to be fixed up by the dynamic linker.)
The actual mechanism for limiting growth is by simply not extending the mapping and allocating a new page when the process triggers a hardware page fault with the stack pointer below the existing stack area. Thus the page fault is an invalid one, instead of a soft aka minor for the normal stack-growth case, so a SIGSEGV is delivered.
If a program used alloca or a C99 VLA with an unchecked size, malicious input could make it jump over any guard pages and into some other read/write mapping such as .data or stuff that's dynamically allocated.
To harden buggy code against that so it segfaults instead of actually allowing a stack clash attack, there are compiler options that make it touch every intervening page as the stack grows, so it's certain to set off the "tripwire" in the form of an unmapped guard page below the stack-growth limit. See Linux process stack overrun by local variables (stack guarding)
If you set ulimit -s unlimited could you maybe grow the stack into some other mapping, if Linux truly does allow unlimited growth in that case without reserving a guard page as you approach another mapping.

How does macOS allocate stack and heap for a process?

I want to know how macOS allocate stack and heap memory for a process, i.e. the memory layout of a process in macOS. I only know that the segments of a mach-o executable are loaded into pages, but I can't find a segment that correspond to stack or heap area of a process. Is there any document about that?
Stacks and heaps are just memory. The only think that makes a stack a stack or a heap or a heap is the way it is accessed. Stacks and heaps are allocated the same way all memory is: by mapping pages into the logical address space.
Let's take a step back - the Mach-o format describes mapping the binary segments into virtual memory. Importantly the memory pages you mentioned have read write and execute permissions. If it's an executable(i.e. not a dylib) it must contain the __PAGEZERO segment with no permissions at all. This is the safe guard area to prevent accessing low addresses of virtual memory by accident (here falls the infamous Null pointer exception and such if attempting to access zero memory address).
__TEXT read executable (typically without write) segment follows which in virtual memory will contain the file representation itself. This implies all the executable code lives here. Also immmutable data like string constants.
The order may vary, but usually next you will encounter __LINKEDIT read only segment. This is the segment dyld uses to setup externally loaded functions, this is too broad to cover here, but there are numerous answers on the topic.
Finally we have the readable writable __DATA segment the first place a process can actually write to. This is used for global/static variables, external addresses to calls populated by dyld.
We have roughly covered the process initial setup when it will launch through either LC_UNIXTHREAD or in modern MacOS (10.7+) LC_MAIN. This starts the process main thread. Each thread must contain it's own stack. The creation of it is handled by operating system (including allocating it). Notice so far the process has no awareness of the heap at all (it's the operating system that's doing the heavy lifting to prepare the stack).
So to sum up so far we have 2 independent sources of memory - the process memory representing the Mach-o structure (size is fixed and determined by the executable structure) and the main thread stack (also with predefined size). The process is about to run a C-like main function , any local variables declared would move the thread stack pointer, likewise any calls to functions (local and external) to at least setup the stack frame for return address. Accessing a global/static variable would reference the __DATA segment virtual memory directly.
Reserving stack space in x86-64 assembly would look like this:
sub rsp,16
There are some great SO anwers on System V / AMD64 ABI (which includes MacOS) requirements for stack alignment like this one
Any new thread created will have its own stack to allow setting up stack frames for local variables and calling functions.
Now we can cover heap allocation - which is mitigated by the libSystem (aka MacOS C standard library) delivering the malloc/free. Internally this is handled by mmap & munmap system calls - the kernel API for managing memory pages.
Using those system calls directly is possible, but might turned out inefficient, thus an internal memory pool is utilised by malloc/free to limit the number of system calls (which are costly to make).
The changing addresses you mentioned in the comment are caused by:
ASLR aka PIE (position independent code) for process memory , which is a security measure randomizing the start of virtual memory
Thread local stacks being prepared by the operating system

32-bit physical page table resolution

I'm running a 32 bit system in legacy mode on a 64-bit (x86-64 that is) capable architecture. When a new process is created, the kernel has to decide where in physical memory all of the pages needed at the time of instantiation are to be allocated (assuming a single thread this may include several memory regions such as the stack, the heaps etc).
I'm assuming the kernel keeps some sort of dynamic list of the physical RAM frames that are in use, and also a static list of all the regions of physical memory that have been taken up by devices for systems that use memory-mapped IO. Is this correct?
In addition, I also read that a 32-bit Windows system has a physical memory limit of 4GB (probably due to minimum address bus assumptions) so, even though a system may have more than 4 gigabytes of physical memory installed, a 32 bit kernel will only allocate addresses within the 4GB range.
Specific information regarding low-level operating system implementation for specific cases such as this is quite difficult to find online. Can anyone verify these statements and possibly refer me to a source where I could attain more information?
Thanks for your considerations.
When a new process is created, the kernel has to decide where in physical memory all of the pages needed at the time of instantiation are to be allocated
Why does it have to decide at process creation time? In fact, it only creates them on-demand - it simply creates the PTEs (i.e. "This address range is valid", but the pages are not backed in any way); when the process first starts executing, it immediately page-faults.
What is a page fault though? What happens is, first the CPU reads the TLB to see if it has an address <=> frame mapping. When that fails, it walks the PTEs looking for an entry that matches. If no entry is found, or if the entry indicates that the page isn't backed, a page-fault is generated. This means, that a CPU exception occurs and the CPU immediately jumps to a predefined address. The first thing the kernel then does is save the CPU Context (i.e. the registers at the location of the fault), then dispatches to the page fault handler.
When the page-fault occurs, Mm (the Memory Manager in NT) will read the mapping in its own data structures (remember that all PE images are memory-mapped files) and determine at that time which physical frame (i.e. 'a real piece of memory') which will be used.
Once the page fault is serviced, the page fault restores the saved CPU context, and jumps back to where it was, and retries the instruction that faulted.
You're correct that a 32-bit OS will only use 4GB of address space (not RAM! Don't forget those memory-mapped devices and files!), the processor will operate in 32-bit mode and interpret the PTEs as 32-bit (remember that AMD64 long mode adds an extra level of page tables and extends the address space to 48 bits).
32bit systems can only ever address 4gig directly (2^32 = 4gig). There's PAE hacks, which let the system have more than 4gig of physical ram, but no process can ever have more than 4gig available. As well, even if you have 4gig of ram, you'll never see more than 3.5gig or so actually available - some is reserved for memory mapping hardware devices, such as your video ram.
For one method of dealing with the physical-virtual memory mapping, look at TLB

What is the state of the art in Memory Protection?

The more I read about low level languages like C and pointers and memory management, it makes me wonder about the current state of the art with modern operating systems and memory protection. For example what kind of checks are in place that prevent some rogue program from randomly trying to read as much address space as it can and disregard the rules set in place by the operating system?
In general terms how do these memory protection schemes work? What are their strength and weaknesses? To put it another way, are there things that simply cannot be done anymore when running a compiled program in a modern OS even if you have C and you own compiler with whatever tweaks you want?
The protection is enforced by the hardware (i.e., by the CPU). Applications can only express addresses as virtual addresses and the CPU resolves the mapping of virtual address to physical address using lookaside buffers. Whenever the CPU needs to resolve an unknown address it generates a 'page fault' which interrupts the current running application and switches control to the operating system. The operating system is responsible for looking up its internal structures (page tables) and find a mapping between the virtual address touched by the application and the actual physical address. Once the mapping is found the CPU can resume the application.
The CPU instructions needed to load a mapping between a physical address and a virtual one are protected and as such can only be executed by a protected component (ie. the OS kernel).
Overall the scheme works because:
applications cannot address physical memory
resolving mapping from virtual to physical requires protected operations
only the OS kernel is allowed to execute protected operations
The scheme fails though if a rogue module is loaded in the kernel, because at that protection level it can read and write into any physical address.
Application can read and write other processes memory, but only by asking the kernel to do this operation for them (eg. in Win32 ReadProcessMemory), and such APIs are protected by access control (certain privileges are required on the caller).
Memory protection is enforced in hardware, typically with a minimum granularity on the order of KBs.
From the Wikipedia article about memory protection:
In paging, the memory address space is
divided into equal, small pieces,
called pages. Using a virtual memory
mechanism, each page can be made to
reside in any location of the physical
memory, or be flagged as being
protected. Virtual memory makes it
possible to have a linear virtual
memory address space and to use it to
access blocks fragmented over physical
memory address space.
Most computer architectures based on
pages, most notably x86 architecture,
also use pages for memory protection.
A page table is used for mapping
virtual memory to physical memory. The
page table is usually invisible to the
process. Page tables make it easier to
allocate new memory, as each new page
can be allocated from anywhere in
physical memory.
By such design, it is impossible for
an application to access a page that
has not been explicitly allocated to
it, simply because any memory address,
even a completely random one, that
application may decide to use, either
points to an allocated page, or
generates a page fault (PF) error.
Unallocated pages simply do not have
any addresses from the application
point of view.
You should ask Google for Segmentation fault, Memory Violation Error and General Protection Failure. These are errors returned by various OSes in response for a program trying to access memory address it shouldn't access.
And Windows Vista (or 7) has routines for randomized dll attaching, which means that buffer overflow can take you to different addresses each time it occurs. This also makes buffer overflow attack a little bit less repeatable.
So, to link together the answers posted with your question. A program that attempts to read any memory address that is not mapped in its address space, will cause the processor to issue a page fault exception transferring execution control to the operating system code (trusted code), the kernel will then check which is the faulty address, if there is no mapping in the current process address space, it will send the SIGSEV (segmentation fault) signal to the process which typically kills the process (talking about Linux/Unix here), on Windows you get something along the same lines.
Note: you can take a look at mprotect() in Linux and POSIX operating systems, it allows you to protect pages of memory explicitly, functions like malloc() return memory on pages with default protection, which you can then modify, this way you can protect areas of memory as read only (but just in page size chunks, typically around 4KB).

Resources