How to calculate total RSS of all thread stacks under Linux? - linux-kernel

I have a heavily multi-threaded application under Linux consuming lots of memory and I am trying to categorize its RSS. I found particularly challenging to estimate total RSS of all thread stacks in program. I had following ideas:
Idea 1: look into /proc/<pid>/smaps and consider mappings for stacks; there is an information regarding resident size of each mapping but only the main thread mapping is annotated like [stack]; the rest of them is indistinguishable from regular 8 MiB mappings (with default stack size). Also reading /proc/<pid>/smaps is pretty expensive as it produces contention on kernel innternal VMA data structures.
Idea 2: look into /proc/<tid>/status; there is VmStk section which should describe stack resident size, but it always shows stack size of a main thread. It looks pretty clear why: beacuse main thread is the only one for which kernel allocates stack by itself, while the rest of threads gets stack from pthreads code which allocates it as a regular memory mapping.
Idea 3: traverse threads from user-space using some stuff from pthreads, retrieve stack mapping address and stack size for each thread and then find out how many pages are resident using mincore(2). As a possible optimization, we may skip calling mincore for sleeping threads using the cached value for them. Unfortunately, I did not find any suitable way to iterate over pthread_t structures. Note that part of the threads comes from the libraries which I am not able to control, so maintaining any kind of thread registry by registering threads on startup is not possible.
Idea 4: use ptrace(2) to retrieve thread registers, retrive stack pointers from them, then proceed with Idea 1. This way looks excessively hard and intrusive.
Can anybody provide me more or less intended way to do so? Being non-portable is OK.
Two more ideas I got after some extra research:
Idea 5: from man 5 proc on /proc/<pid>/maps:
There are additional helpful pseudo-paths:
[stack]
The initial process's (also known as the main thread's) stack.
[stack:<tid>] (since Linux 3.4)
A thread's stack (where the <tid> is a thread ID). It corresponds to the /proc/[pid]/task/[tid]/ path.
It looks intriguing, but it seems that this logic has been reverted as it was implemented ineffiiently: https://lore.kernel.org/patchwork/patch/716239/. Man page seems obsolete (at least on my Ubuntu Disco 19.04).
Idea 6: This one may actually work. There is an /proc/<tid>/syscall file which may expose thread stack register for a blocked thread. Considering the fact that most of my threads are sleeping on I/O, this allows me to track their rsp value, which I may project onto /proc/<pid>/maps to find the correspondence between thread and its stack mapping. After that I may implement Idea 3.

Related

Some clarification on TCB of an operting system

I'm a computer undergraduate taking operating systems course. For my assignment, I am required to implement a simple thread management system.
I'm in the process of creating a struct for a TCB. According to my lecture notes, what I could have in my TCB are:
registers,
program counter,
stack pointer,
thread ID and
process ID
Now according to my lecture notes, each thread should have its own stack. And my problem is this:
Just by storing the stack pointer, can I keep a unique stack per thread? If I did so, won't one stack of a thread over write other's stack?
How can I prevent that? Limit the stack for each thread??? Please tell me how this is usually done in a normal operating system.
Please help. Thanks in advance.
The OS may control stack growth by monitoring page faults from inaccessible pages located around the stack portion of the address space. This can help with detection of stack overflows by small amounts.
But if you move the stack pointer way outside the stack region of the address space and use it to access memory, you may step into the global variables or into the heap or the code or another thread's stack and corrupt whatever's there.
Threads run in the same address space for a reason, to share code and data between one another with minimal overhead and their stacks usually aren't excepted from sharing, from being accessible.
The OS is generally unable to do anything about preventing programs from stack overflows and corruptions and helping them to recover from those. The OS simply doesn't and can't know how an arbitrary program works and what it's supposed to do, hence it can't know when things start going wrong and what to do about them. The only thing the OS can do is just terminate a program that's doing something very wrong like trying to access inaccessible resources (memory, system registers, etc) or execute invalid or inaccessible instructions.

Does coroutine stacks grow in Lua, Python, Ruby or any other languages?

There are some languages which support deterministic lightweight concurrency - coroutine.
Lua - coroutine
Stack-less Python - tasklet
Ruby - fiber
should be many more... but currently I don't have much idea.
Anyway as far as I know, it needs many of separated stacks, so I want to know how these languages handle the stack growth. This because I read some mention about Ruby Fiber which comes with 4KB - obviously big overhead - and they are advertising this as a feature that prevents stack overflow. But I don't understand why they're just saying the stacks will grow automatically. It doesn't make sense the VM - which is not restricted to C stack - can't handle stack growth, but I can't confirm this because I don't know about internals well.
How do they handle stack growth on these kind of micro-threads? Is there any explicit/implicit limitations? Or just will be handled clearly and automatically?
For ruby:
As per this google tech talk the ruby vm uses a slightly hacky system involving having a copy of the C stack for each thread and then copying that stack on to the main stack every time it switches between fibers. This means that Ruby is still restricts each fibre from having more than a 4KB stack but the interpreter does not overflow if you switch between deeply nested fibres.
For python:
task-lets are only available in the stackless variant. Each thread gets its own heap based stack as the stackless python vm uses heap based stacks. This mess they are inherently only limited to the size of the heap in stack growth. This means that for 32 bit systems there is still an effective limit of 1-4 GB.
For Lua:
Lua uses a heap based stack so are inherently only limited to the size of the heap in stack growth. Each coroutine gets its own stack in the memory. This means that for 32 bit systems there is still an effective limit of 1-4 GB.
To add a couple more to your list C# and VB.Net both now support async/await. This is a system that allows the program to preform a time consuming operation and have the rest of that function continue afterwards. This is implemented by creating an object to represent the method with a single method that is called to advance to the next step in the method which is called when you attempt to get a result and various other internal locations. The original method is replaced with one that creates the object. This means that the recursion depth is not affected as the method is never more than a few steps further down the stack than you would expect.

Default memory block for Unix/Linux threads?

Does anybody know how much default memory is allocated to a thread created on Unix/Linux operating system?
For windows xp OS i found that it allocates a memory block of 1MB, is it correct?
Thanks in advance.
There's not going to be a single answer to that question.
In fact there's not even a single answer on Windows. Different executables specify different stack limits. And even within a single process, individual threads can have different stack limits.
And it gets even more complicated when you factor in the differences between .net and native executables. Rather strangely .net executables commit the entire stack allocation for each thread as soon as the thread starts. On the other hand, native executables reserve the stack allocation and then commit memory on demand using guard pages.
You can see how much space is allocated for thread stacks (measured in kbytes) with ulimit -s.
Quoting from the pthread_create(3) manpage:
On Linux/x86-32, the default stack
size for a new thread is 2 megabytes.
Under the NPTL threading
implementation, if the RLIMIT_STACK
soft resource limit at the time the
program started has any value other
than "unlimited", then it determines
the default stack size of new threads.
Using pthread_attr_setstacksize(3),
the stack size attribute can be
explicitly set in the attr argument
used to create a thread, in order to
obtain a stack size other than the
default.

confusion regarding thread in linux

I know that there is no special difference between thread and processing linux, except keeping the cr3 register untouched during the thread switch and tlb flush during process switch.
Since the threads in groud share same address space and as pgd(page table) is not changed meaning whole memory layout is shared, and hence stack space also gets shared, but as per the general definition thread owns its own stack, how is this acheived in linux.
if its like threadA has stack from x-y range, then at the first pagefault occurs and page table is updated, similarly threadB which uses the range u-v, would update the same pagetable. Hence it is possible to mess up the stack of threadB from threadA.
I just want to get the clear picture on this, help me out.Is this the safe implementation of thread?.
That's correct, there is no OS-enforced protection of the stack memory between threads. One thread A can corrupt the stack of another thread B (if thread A knows where in memory to look).

How many stacks does a windows program use?

Are return address and data mixed/stored in the same stack, or in 2 different stacks, which is the case?
They are mixed. However, it depends on the actual programming language / compiler. I can image a compiler allocating space for local variable on the heap and keeping a pointer to the storage on the stack.
There is one stack per thread in each process. Hence, for example, a process with 20 threads has 20 independent stacks.
As others have already pointed out, it's mostly a single, mixed stack. I'll just add one minor detail: reasonably recent processors also have a small cache of return addresses that's stored in the processor itself, and this stores only return addresses, not other data. It's mostly invisible outside of faster execution though...
It depends on the compiler, but the x86 architecture is geared towards a single stack, due to the way push and pop instructions work with a single stack pointer. The compiler would have to do more work maintaining more than one stack.
On more note: every thread in Win32 has its own stack. So, when you tell "windows program" - it depends on how many threads it has. (Of course threads are created/exited during the runtime).

Resources