Correct way to identify stack areas of Pthreads - linux-kernel

I want to identify the stack areas of pthreads (on Linux), for this I am using a kernel module to walk through the VMAs of the process (as I have learned that stack areas of pthreads are allocated in the heap area of process) and checking whether VM_STACK_FLAGS (include/linux/mm.h) is set for the VMA, if yes, I consider the VMA as stack.
Is this approach the correct way? or Is there a better way to identify stacks for pthreads on Linux?
Thanks

Related

Does the Windows ABI allow me to change the stack pointer?

I know that the windows ABI have some restrictions about code generation for procedure's prologs & epilog, but I was wondering if it's fine by the OS to allocate a large heap storage and point the stack pointer to this location (and restore the RSP before the function returns)?
Basically, from what I understand windows threads have a hard limit of 4GB and I wonder if it's OK to increase the stack limit that way or if there's another way to do so?
I have read the information that MSDN has about the x64 stack usage here but I could not find any information about assigning new value to the stack register

How to calculate total RSS of all thread stacks under Linux?

I have a heavily multi-threaded application under Linux consuming lots of memory and I am trying to categorize its RSS. I found particularly challenging to estimate total RSS of all thread stacks in program. I had following ideas:
Idea 1: look into /proc/<pid>/smaps and consider mappings for stacks; there is an information regarding resident size of each mapping but only the main thread mapping is annotated like [stack]; the rest of them is indistinguishable from regular 8 MiB mappings (with default stack size). Also reading /proc/<pid>/smaps is pretty expensive as it produces contention on kernel innternal VMA data structures.
Idea 2: look into /proc/<tid>/status; there is VmStk section which should describe stack resident size, but it always shows stack size of a main thread. It looks pretty clear why: beacuse main thread is the only one for which kernel allocates stack by itself, while the rest of threads gets stack from pthreads code which allocates it as a regular memory mapping.
Idea 3: traverse threads from user-space using some stuff from pthreads, retrieve stack mapping address and stack size for each thread and then find out how many pages are resident using mincore(2). As a possible optimization, we may skip calling mincore for sleeping threads using the cached value for them. Unfortunately, I did not find any suitable way to iterate over pthread_t structures. Note that part of the threads comes from the libraries which I am not able to control, so maintaining any kind of thread registry by registering threads on startup is not possible.
Idea 4: use ptrace(2) to retrieve thread registers, retrive stack pointers from them, then proceed with Idea 1. This way looks excessively hard and intrusive.
Can anybody provide me more or less intended way to do so? Being non-portable is OK.
Two more ideas I got after some extra research:
Idea 5: from man 5 proc on /proc/<pid>/maps:
There are additional helpful pseudo-paths:
[stack]
The initial process's (also known as the main thread's) stack.
[stack:<tid>] (since Linux 3.4)
A thread's stack (where the <tid> is a thread ID). It corresponds to the /proc/[pid]/task/[tid]/ path.
It looks intriguing, but it seems that this logic has been reverted as it was implemented ineffiiently: https://lore.kernel.org/patchwork/patch/716239/. Man page seems obsolete (at least on my Ubuntu Disco 19.04).
Idea 6: This one may actually work. There is an /proc/<tid>/syscall file which may expose thread stack register for a blocked thread. Considering the fact that most of my threads are sleeping on I/O, this allows me to track their rsp value, which I may project onto /proc/<pid>/maps to find the correspondence between thread and its stack mapping. After that I may implement Idea 3.

How does macOS allocate stack and heap for a process?

I want to know how macOS allocate stack and heap memory for a process, i.e. the memory layout of a process in macOS. I only know that the segments of a mach-o executable are loaded into pages, but I can't find a segment that correspond to stack or heap area of a process. Is there any document about that?
Stacks and heaps are just memory. The only think that makes a stack a stack or a heap or a heap is the way it is accessed. Stacks and heaps are allocated the same way all memory is: by mapping pages into the logical address space.
Let's take a step back - the Mach-o format describes mapping the binary segments into virtual memory. Importantly the memory pages you mentioned have read write and execute permissions. If it's an executable(i.e. not a dylib) it must contain the __PAGEZERO segment with no permissions at all. This is the safe guard area to prevent accessing low addresses of virtual memory by accident (here falls the infamous Null pointer exception and such if attempting to access zero memory address).
__TEXT read executable (typically without write) segment follows which in virtual memory will contain the file representation itself. This implies all the executable code lives here. Also immmutable data like string constants.
The order may vary, but usually next you will encounter __LINKEDIT read only segment. This is the segment dyld uses to setup externally loaded functions, this is too broad to cover here, but there are numerous answers on the topic.
Finally we have the readable writable __DATA segment the first place a process can actually write to. This is used for global/static variables, external addresses to calls populated by dyld.
We have roughly covered the process initial setup when it will launch through either LC_UNIXTHREAD or in modern MacOS (10.7+) LC_MAIN. This starts the process main thread. Each thread must contain it's own stack. The creation of it is handled by operating system (including allocating it). Notice so far the process has no awareness of the heap at all (it's the operating system that's doing the heavy lifting to prepare the stack).
So to sum up so far we have 2 independent sources of memory - the process memory representing the Mach-o structure (size is fixed and determined by the executable structure) and the main thread stack (also with predefined size). The process is about to run a C-like main function , any local variables declared would move the thread stack pointer, likewise any calls to functions (local and external) to at least setup the stack frame for return address. Accessing a global/static variable would reference the __DATA segment virtual memory directly.
Reserving stack space in x86-64 assembly would look like this:
sub rsp,16
There are some great SO anwers on System V / AMD64 ABI (which includes MacOS) requirements for stack alignment like this one
Any new thread created will have its own stack to allow setting up stack frames for local variables and calling functions.
Now we can cover heap allocation - which is mitigated by the libSystem (aka MacOS C standard library) delivering the malloc/free. Internally this is handled by mmap & munmap system calls - the kernel API for managing memory pages.
Using those system calls directly is possible, but might turned out inefficient, thus an internal memory pool is utilised by malloc/free to limit the number of system calls (which are costly to make).
The changing addresses you mentioned in the comment are caused by:
ASLR aka PIE (position independent code) for process memory , which is a security measure randomizing the start of virtual memory
Thread local stacks being prepared by the operating system

How to ensure that malloc and mmap cooperate, i.e., work on non-overlapping memory regions?

My main problem is that I need to enable multiple OS processes to communicate via a large shared memory heap that is mapped to identical address ranges in all processes. (To make sure that pointer values are actually meaningful.)
Now, I run into trouble that part of the program/library is using standard malloc/free and it seems to me that the underlying implementation does not respect mappings I create with mmap.
Or, another option is that I create mappings in regions that malloc already planned to use.
Unfortunately, I am not able to guarantee 100% identical malloc/free behavior in all processes before I establish the mmap-mappings.
This leads me to give the MAP_FIXED flag to mmap. The first process is using 0x0 as base address to ensure that the mapping range is at least somehow reasonable, but that does not seem to transfer to other processes. (The binary is also linked with -Wl,-no_pie.)
I tried to figure out whether I could query the system to know which pages it plans to use for malloc by reading up on malloc_default_zone, but that API does not seem to offer what I need.
Is there any way to ensure that malloc is not using particular memory pages/address ranges?
(It needs to work on OSX. Linux tips, which guide me in the right direction are appreciate, too.)
I notice this in the mmap documentation:
If MAP_FIXED is specified, a successful mmap deletes any previous mapping in the allocated address range
However, malloc won't use map fixed, so as long as you get in before malloc, you'd be okay: you could test whether a region is free by first trying to map it without MAP_FIXED, and if that succeeds at the same address (which it will do if the address is free) then you can remap with MAP_FIXED knowing that you're not choosing a section of address space that malloc had already grabbed
The only guaranteed way to guarantee that the same block of logical memory will be available in two processes is to have one fork from the other.
However, if you're compiling with 64-bit pointers, then you can just pick an (unusual) region of memory, and hope for the best, since the chance of collision is tiny.
See also this question about valid address spaces.
OpenBSD malloc() implementation uses mmap() for memory allocation. I suggest you to see how does it work then write your own custom implementation of malloc() and tell your program and the libraries used by it to use your own implementation of malloc().
Here is OpenBSD malloc():
http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/stdlib/malloc.c?rev=1.140
RBA

how does stack growing work on windows and linux?

I just read that windows programs call _alloca on function entry to grow the stack if they need more than 4k on the stack. I guss that every time the guard page is hit windows allocates a new page for the stack, therefore _alloca accesses the stack in 4k steps to allocate the space.
I also read that this only applies to windows. How does linux (or other oses) solve this problem if they don't need _alloca?
Linux relies on a heavily optimized page fault handling, so what happens is that the program just pushes things on the stack and the page fault handler will extend the stack on the fly.

Resources