In 32-bit Windows (at least with Microsoft compilers), exception handling is implemented using a stack of exception frames allocated dynamically on the call stack; the top of the exception stack is pointed to by a TIB entry. The runtime cost is a couple of PUSH/POP instructions per function that needs to handle exceptions, spilling the variables accessed by the exception handler onto the stack, and when handling an exception, a simple linked list walk.
In both 64-bit Windows and the Itanium / System V x86-64 ABI, unwinding instead uses a big sorted list describing all the functions in memory. The runtime cost is some tables per every function (not just ones involved in exception handling), complications for dynamically generated code, and when handling an exception, walking the function list once per every active function regardless of whether it has anything to do with exceptions or not.
How is the latter better than the former? I understand why the Itanium model is cheaper in the common case than the traditional UNIX one based on setjmp/longjmp, but a couple of PUSHes andPOPs plus some register spillage in 32-bit Windows doesn't seem that bad, for the (seemingly) much quicker and simpler handling that it provides. (IIRC, Windows API calls routinely consume Ks of stack space anyway, so it’s not like we gain anything by forcing this data out into tables.)
In addition to optimizing the happy case, perhaps there was also a concern that buffer overflow vulnerabilities could expose the information in the exception. If this information gets corrupted, it could seriously confuse the user, or maybe even cause further errors (remember std::terminate() is called if another exception gets thrown).
Source: http://www.osronline.com/article.cfm%5earticle=469.htm
Related
There are some languages which support deterministic lightweight concurrency - coroutine.
Lua - coroutine
Stack-less Python - tasklet
Ruby - fiber
should be many more... but currently I don't have much idea.
Anyway as far as I know, it needs many of separated stacks, so I want to know how these languages handle the stack growth. This because I read some mention about Ruby Fiber which comes with 4KB - obviously big overhead - and they are advertising this as a feature that prevents stack overflow. But I don't understand why they're just saying the stacks will grow automatically. It doesn't make sense the VM - which is not restricted to C stack - can't handle stack growth, but I can't confirm this because I don't know about internals well.
How do they handle stack growth on these kind of micro-threads? Is there any explicit/implicit limitations? Or just will be handled clearly and automatically?
For ruby:
As per this google tech talk the ruby vm uses a slightly hacky system involving having a copy of the C stack for each thread and then copying that stack on to the main stack every time it switches between fibers. This means that Ruby is still restricts each fibre from having more than a 4KB stack but the interpreter does not overflow if you switch between deeply nested fibres.
For python:
task-lets are only available in the stackless variant. Each thread gets its own heap based stack as the stackless python vm uses heap based stacks. This mess they are inherently only limited to the size of the heap in stack growth. This means that for 32 bit systems there is still an effective limit of 1-4 GB.
For Lua:
Lua uses a heap based stack so are inherently only limited to the size of the heap in stack growth. Each coroutine gets its own stack in the memory. This means that for 32 bit systems there is still an effective limit of 1-4 GB.
To add a couple more to your list C# and VB.Net both now support async/await. This is a system that allows the program to preform a time consuming operation and have the rest of that function continue afterwards. This is implemented by creating an object to represent the method with a single method that is called to advance to the next step in the method which is called when you attempt to get a result and various other internal locations. The original method is replaced with one that creates the object. This means that the recursion depth is not affected as the method is never more than a few steps further down the stack than you would expect.
I see many articles suggesting not to map huge files as mmap files so the virtual address space won't be taken solely by the mmap.
How does that change with 64 bit process where the address space dramatically increases?
If I need to randomly access a file, is there a reason not to map the whole file at once? (dozens of GBs file)
On 64bit, go ahead and map the file.
One thing to consider, based on Linux experience: if the access is truly random and the file is much bigger than you can expect to cache in RAM (so the chances of hitting a page again are slim) then it can be worth specifying MADV_RANDOM to madvise to stop the accumulation of hit file pages steadily and pointlessly swapping other actually useful stuff out. No idea what the windows equivalent API is though.
There's a reason to think carefully of using memory-mapped files, even on 64-bit platform (where virtual address space size is not an issue). It's related to the (potential) error handling.
When reading the file "conventionally" - any I/O error is reported by the appropriate function return value. The rest of error handling is up to you.
OTOH if the error arises during the implicit I/O (resulting from the page fault and attempt to load the needed file portion into the appropriate memory page) - the error handling mechanism depends on the OS.
In Windows the error handling is performed via SEH - so-called "structured exception handling". The exception propagates to the user mode (application's code) where you have a chance to handle it properly. The proper handling requires you to compile with the appropriate exception handling settings in the compiler (to guarantee the invocation of the destructors, if applicable).
I don't know how the error handling is performed in unix/linux though.
P.S. I don't say don't use memory-mapped files. I say do this carefully
One thing to be aware of is that memory mapping requires big contiguous chunks of (virtual) memory when the mapping is created; on a 32-bit system this particularly sucks because on a loaded system, getting long runs of contiguous ram is unlikely and the mapping will fail. On a 64-bit system this is much easier as the upper bound of 64-bit is... huge.
If you are running code in controlled environments (e.g. 64-bit server environments you are building yourself and know to run this code just fine) go ahead and map the entire file and just deal with it.
If you are trying to write general purpose code that will be in software that could run on any number of types of configurations, you'll want to stick to a smaller chunked mapping strategy. For example, mapping large files to collections of 1GB chunks and having an abstraction layer that takes operations like read(offset) and converts them to the offset in the right chunk before performing the op.
Hope that helps.
Would the OS send a warning to the user before a threshold and then the application would actually crash if there is not enough memory to allocate the stack (local) variables of the current function?
Yes, you would get a Stack Overflow run-time error.
Side note: There is a popular web site named after this very error!
Stack allocation can fail and there's nothing you can do about it.
On a modern OS, a significant amount of memory will be committed for the stack to begin with (on Linux it seems to be 128k or so these days) and a (usually much larger, e.g. 8M on Linux, and usually configurable) range of virtual addresses will be reserved for stack growth. If you exceed the committed part, committing more memory could fail due to out-of-memory condition and your program will crash with SIGSEGV. If you exceed the reserved address range, your program will definitely fail, possibly catastrophically if it ends up overwriting other data just below the stack address range.
The solution is not to do insane things with the stack. Even the initial committed amount on Linux (128k) is more stack space than you should ever use. Don't use call recursion unless you have a logarithmic bound on the number of call levels, don't use gigantic automatic arrays or structures (including ones that might result from user-provided VLA dimensions), and you'll be just fine.
Note that there is no portable and no future-safe way to measure current stack usage and remaining availability, so you just have to be safe about it.
Edit: One guarantee you do have about stack allocations, at least on real-world systems, (without the split-stack hack) is that stack space you've already verified you have won't magically disappear. For instance if you successfully once call c() from b() from a() from main(), and they're not using any VLA's that could vary in size, a second repetition of this same call pattern in the same instance of your program won't fail. You can also find tools to perform static analysis on some programs (ones without fancy use of function pointers and/or recursion) that will determine the maximum amount of stack space ever consumed by your program, after which you could setup to verify at program start that you can successfully use that much space before proceeding.
Well... semantically speaking, there is no stack.
From the point of view of the language, automatic storage just works and dynamic storage may fail in well-determined ways (malloc returns NULL, new throws a std::bad_alloc).
Of course, implementations will usually bring up a stack to implement the automatic storage, and one that is limited in size at that. However this is an implementation detail, and need not be so.
For example, gcc -fsplit-stack allows you to have a fractionned stack that grows as you need. This technic is quite recent for C or C++ AFAIK, but languages with continuations (and thousands or millions of them) like Haskell have this built-in and Go made a point about it too.
Still, at some point, the memory will get exhausted if you keep hammering at it. This is actually undefined behavior since the Standard does not attempt to deal with this, at all. In this case, typically, the OS will send a signal to the program which will shut off and the stack will not get unwound.
The process would get killed by the OS if it runs out of stack space.
The exact mechanics are OS-specific. For example, running out of stack space on Linux triggers a segfault.
While the operating system may not inform you that you're out of stack space, you can check this yourself with a bit on inline assembly:
unsigned long StackSpace()
{
unsigned long retn = 0;
unsigned long *rv = &retn;
__asm
{
mov eax, FS:[0x08]
sub eax, esp
mov [rv], eax
}
return retn;
}
You can determine the value of FS:[*] by referring to the windows Thread Information Block
Edit: Meant to subtract esp from eax, not ebx XD
Are return address and data mixed/stored in the same stack, or in 2 different stacks, which is the case?
They are mixed. However, it depends on the actual programming language / compiler. I can image a compiler allocating space for local variable on the heap and keeping a pointer to the storage on the stack.
There is one stack per thread in each process. Hence, for example, a process with 20 threads has 20 independent stacks.
As others have already pointed out, it's mostly a single, mixed stack. I'll just add one minor detail: reasonably recent processors also have a small cache of return addresses that's stored in the processor itself, and this stores only return addresses, not other data. It's mostly invisible outside of faster execution though...
It depends on the compiler, but the x86 architecture is geared towards a single stack, due to the way push and pop instructions work with a single stack pointer. The compiler would have to do more work maintaining more than one stack.
On more note: every thread in Win32 has its own stack. So, when you tell "windows program" - it depends on how many threads it has. (Of course threads are created/exited during the runtime).
I have a process running under Windows Server 2003 SP2. When I want to check stack trace of one of its threads it is always limited to 9 entries. Those entries are resolved correctly (I have PDBs in place) but list is just cut in middle.
Do you know of any limitation in Process Explorer?
I am assuming that you think the complete stack trace for this thread should have more than 9 entries. You don't mention if 32 bit OS or 64 bit OS, but I will assume 32 bit OS and then cover 64 bit as an afterthought.
Sometimes when collecting a stack trace on 32 bit systems you cannot collect any items for the stack trace or you can only collect a limited amount of stack frame information even though you know the callstack is deeper. The reasons for this are:
Different calling conventions put data in different places on the stack, making it hard to walk the stack. I can think of 4 definitions, 3 in common use, one more exotic: cdecl, fastcall, stdcall, naked.
For release builds, the code optimizer may do away with the frame pointers using a technique known as Frame Pointer Omission (FPO). Without the FPO (and sometimes, even with the FPO data in a PDB file) you cannot successfully walk the callstack.
Hooks - any helper DLLs, anti-virus, debugging hooks, instrumented code, malware, etc, may mess up the callstack at somepoint because they've inserted their own stub code on the callstack and that small section may not be walkable by the stack walker.
Bytecode virtual machines. Depending upon how the virtual machine is written, the VM may place trampolines on the callstack to aid its execution. These will make the stack hard to walk successfully.
Because of the variety of calling conventions on 32 bit Windows (from both Microsoft and other vendors) it is hard to work out what to expect when you move from one frame to another.
For 64 bit systems there is one calling convention specified. That makes life a lot easier. That said, you still have the issues of helper DLLs and hooks doing their own thing with the stack and that may still cause you problems when walking the stack.
I doubt there is a limitation in Process Explorer. I think the issue is just that walking the callstack for that thread is problematic because of one of the reasons I've listed above.