How many stacks does a windows program use? - windows

Are return address and data mixed/stored in the same stack, or in 2 different stacks, which is the case?

They are mixed. However, it depends on the actual programming language / compiler. I can image a compiler allocating space for local variable on the heap and keeping a pointer to the storage on the stack.
There is one stack per thread in each process. Hence, for example, a process with 20 threads has 20 independent stacks.

As others have already pointed out, it's mostly a single, mixed stack. I'll just add one minor detail: reasonably recent processors also have a small cache of return addresses that's stored in the processor itself, and this stores only return addresses, not other data. It's mostly invisible outside of faster execution though...

It depends on the compiler, but the x86 architecture is geared towards a single stack, due to the way push and pop instructions work with a single stack pointer. The compiler would have to do more work maintaining more than one stack.

On more note: every thread in Win32 has its own stack. So, when you tell "windows program" - it depends on how many threads it has. (Of course threads are created/exited during the runtime).

Related

How to calculate total RSS of all thread stacks under Linux?

I have a heavily multi-threaded application under Linux consuming lots of memory and I am trying to categorize its RSS. I found particularly challenging to estimate total RSS of all thread stacks in program. I had following ideas:
Idea 1: look into /proc/<pid>/smaps and consider mappings for stacks; there is an information regarding resident size of each mapping but only the main thread mapping is annotated like [stack]; the rest of them is indistinguishable from regular 8 MiB mappings (with default stack size). Also reading /proc/<pid>/smaps is pretty expensive as it produces contention on kernel innternal VMA data structures.
Idea 2: look into /proc/<tid>/status; there is VmStk section which should describe stack resident size, but it always shows stack size of a main thread. It looks pretty clear why: beacuse main thread is the only one for which kernel allocates stack by itself, while the rest of threads gets stack from pthreads code which allocates it as a regular memory mapping.
Idea 3: traverse threads from user-space using some stuff from pthreads, retrieve stack mapping address and stack size for each thread and then find out how many pages are resident using mincore(2). As a possible optimization, we may skip calling mincore for sleeping threads using the cached value for them. Unfortunately, I did not find any suitable way to iterate over pthread_t structures. Note that part of the threads comes from the libraries which I am not able to control, so maintaining any kind of thread registry by registering threads on startup is not possible.
Idea 4: use ptrace(2) to retrieve thread registers, retrive stack pointers from them, then proceed with Idea 1. This way looks excessively hard and intrusive.
Can anybody provide me more or less intended way to do so? Being non-portable is OK.
Two more ideas I got after some extra research:
Idea 5: from man 5 proc on /proc/<pid>/maps:
There are additional helpful pseudo-paths:
[stack]
The initial process's (also known as the main thread's) stack.
[stack:<tid>] (since Linux 3.4)
A thread's stack (where the <tid> is a thread ID). It corresponds to the /proc/[pid]/task/[tid]/ path.
It looks intriguing, but it seems that this logic has been reverted as it was implemented ineffiiently: https://lore.kernel.org/patchwork/patch/716239/. Man page seems obsolete (at least on my Ubuntu Disco 19.04).
Idea 6: This one may actually work. There is an /proc/<tid>/syscall file which may expose thread stack register for a blocked thread. Considering the fact that most of my threads are sleeping on I/O, this allows me to track their rsp value, which I may project onto /proc/<pid>/maps to find the correspondence between thread and its stack mapping. After that I may implement Idea 3.

Some clarification on TCB of an operting system

I'm a computer undergraduate taking operating systems course. For my assignment, I am required to implement a simple thread management system.
I'm in the process of creating a struct for a TCB. According to my lecture notes, what I could have in my TCB are:
registers,
program counter,
stack pointer,
thread ID and
process ID
Now according to my lecture notes, each thread should have its own stack. And my problem is this:
Just by storing the stack pointer, can I keep a unique stack per thread? If I did so, won't one stack of a thread over write other's stack?
How can I prevent that? Limit the stack for each thread??? Please tell me how this is usually done in a normal operating system.
Please help. Thanks in advance.
The OS may control stack growth by monitoring page faults from inaccessible pages located around the stack portion of the address space. This can help with detection of stack overflows by small amounts.
But if you move the stack pointer way outside the stack region of the address space and use it to access memory, you may step into the global variables or into the heap or the code or another thread's stack and corrupt whatever's there.
Threads run in the same address space for a reason, to share code and data between one another with minimal overhead and their stacks usually aren't excepted from sharing, from being accessible.
The OS is generally unable to do anything about preventing programs from stack overflows and corruptions and helping them to recover from those. The OS simply doesn't and can't know how an arbitrary program works and what it's supposed to do, hence it can't know when things start going wrong and what to do about them. The only thing the OS can do is just terminate a program that's doing something very wrong like trying to access inaccessible resources (memory, system registers, etc) or execute invalid or inaccessible instructions.

Call stack management is machine dependent?

I think I understand the basic of stack memory, but I still do not fully understand which is responsible for the mechanism for the way managing the stack - is it the compiler, the cpu architecture? is it programming language dependent?
For example, I read that in ARM there is tendency to reduce the use of stack in function calls, so arguments to functions are usually passed through 4 registers. However, it seems to me that this can be implemented using general purpose registers in other cpu's as well. How can the architecture impose this demand?
Elsewhere I read that in FORTRAN 77 there is no use of the stack.
And there is the question of the stack growing upwards/downwards. who is responsible for it?
Overall I wish to know is it cpu dependent and how is it imposed? otherwise which is responsible for these decisions?
Thanks.
It can't be imposed by the processor. Calling conventions are determined by the compiler, and most compilers will not break their language standard just to do this.
The growth direction of the stack is determined by the processor as long as the process uses things like push/pop. If they access esp directly, they should follow, but don't have too.

If malloc's can fail, how come stack variables initialization can't (at least we don't check that)?

Would the OS send a warning to the user before a threshold and then the application would actually crash if there is not enough memory to allocate the stack (local) variables of the current function?
Yes, you would get a Stack Overflow run-time error.
Side note: There is a popular web site named after this very error!
Stack allocation can fail and there's nothing you can do about it.
On a modern OS, a significant amount of memory will be committed for the stack to begin with (on Linux it seems to be 128k or so these days) and a (usually much larger, e.g. 8M on Linux, and usually configurable) range of virtual addresses will be reserved for stack growth. If you exceed the committed part, committing more memory could fail due to out-of-memory condition and your program will crash with SIGSEGV. If you exceed the reserved address range, your program will definitely fail, possibly catastrophically if it ends up overwriting other data just below the stack address range.
The solution is not to do insane things with the stack. Even the initial committed amount on Linux (128k) is more stack space than you should ever use. Don't use call recursion unless you have a logarithmic bound on the number of call levels, don't use gigantic automatic arrays or structures (including ones that might result from user-provided VLA dimensions), and you'll be just fine.
Note that there is no portable and no future-safe way to measure current stack usage and remaining availability, so you just have to be safe about it.
Edit: One guarantee you do have about stack allocations, at least on real-world systems, (without the split-stack hack) is that stack space you've already verified you have won't magically disappear. For instance if you successfully once call c() from b() from a() from main(), and they're not using any VLA's that could vary in size, a second repetition of this same call pattern in the same instance of your program won't fail. You can also find tools to perform static analysis on some programs (ones without fancy use of function pointers and/or recursion) that will determine the maximum amount of stack space ever consumed by your program, after which you could setup to verify at program start that you can successfully use that much space before proceeding.
Well... semantically speaking, there is no stack.
From the point of view of the language, automatic storage just works and dynamic storage may fail in well-determined ways (malloc returns NULL, new throws a std::bad_alloc).
Of course, implementations will usually bring up a stack to implement the automatic storage, and one that is limited in size at that. However this is an implementation detail, and need not be so.
For example, gcc -fsplit-stack allows you to have a fractionned stack that grows as you need. This technic is quite recent for C or C++ AFAIK, but languages with continuations (and thousands or millions of them) like Haskell have this built-in and Go made a point about it too.
Still, at some point, the memory will get exhausted if you keep hammering at it. This is actually undefined behavior since the Standard does not attempt to deal with this, at all. In this case, typically, the OS will send a signal to the program which will shut off and the stack will not get unwound.
The process would get killed by the OS if it runs out of stack space.
The exact mechanics are OS-specific. For example, running out of stack space on Linux triggers a segfault.
While the operating system may not inform you that you're out of stack space, you can check this yourself with a bit on inline assembly:
unsigned long StackSpace()
{
unsigned long retn = 0;
unsigned long *rv = &retn;
__asm
{
mov eax, FS:[0x08]
sub eax, esp
mov [rv], eax
}
return retn;
}
You can determine the value of FS:[*] by referring to the windows Thread Information Block
Edit: Meant to subtract esp from eax, not ebx XD

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.
It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)
The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.
You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.
Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html
I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).
Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.
I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.
I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.
Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.
I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!
If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.
Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.
The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Resources