how stack overflow is checked in runtime - windows

I am mainly thinking about Windows.
AFAIK on such platforms there are many stacks, each program, or maybe even each thread has its own stack, and each of such threads can push bytes onto it - AFAIK every of such push should be checked in runtime in case of stack overflow - so it seem it is some cost related to each and every push (something like arrays bounds checking) - how exactly this checking is implemented ?
On old machines as I remember there was no checking but some fff become 000 so there was no cost of checking, but today on windows platform it seem to me that probably every stack is bound checked - but I do not know how it is implemented.

I'm not aware of any fully-compiled language on Windows or Linux platforms that does call stack bounds checking by default. Thus, overflowing the available stack space leads to a segmentation fault as described in (for instance) the questions Segmentation fault due to recursion and What is the difference between a segmentation fault and a stack overflow?.
The benefit of not doing bounds checking, as observed in the question is that the code runs more quickly. If one wanted to bounds check for some particular reason, one could insert the bounds checks for that specific case.

Related

Measuring Stack Space

Recently I interviewed with Intel, and I was asked this question
How would you measure the stack space without using the task manager,
when an application is running on a computer? Write an algorithm if
possible
I have no idea how to approach the question. I have even searched the internet but have not found any sort of related answer. So I am looking for help from the people of StackOverflow to help me understand the question and how to approach the solution.
First of all, I'd note that a great deal here relies on how C++ is most often implemented. What I'm going to describe will almost always work on any real implementation on an Intel x86, or anything remotely similar--but it'll probably break completely on something like an IBM mainframe or an early Cray (neither of which really supported a stack in hardware). For that matter, it's not really guaranteed to work on anything--but it will work about as expected on most current hardware (especially: just about anything with an Intel x86 processor).
A great deal here depends on whether we're trying to do this from inside the program, or from the outside (i.e., from another program).
Inside the Program
If we're doing it from inside the program, we can start main with something like this:
char *stack_bottom;
int main() {
int a;
stack_bottom = &a;
This gives us an address at the bottom of the stack (well, technically, not quite the bottom, but really close to it anyway).
Then at any given point that we want to know the current stack depth, we can call a function something like this:
ptrdiff_t stack_size() {
char a;
return abs(stack_bottom - &a);
}
This creates a variable at the top of the stack, then subtracts the two to get the size. The stack usually grows downward in memory, so the "top" of the stack will typically be at a lower address than the top. Since we only care about the magnitude of the difference, we take the absolute value of the difference (though we can avoid that if we know which direction the stack grows).
Windows-specific
Since this seems to be targeting Windows, we can use VirtualQuery to get similar information about stack memory.
Outside the Program
From the "outside" of a program the basic idea is pretty much the same: find the bottom of the stack, find the top of the stack, subtract to find the size. But, chances are pretty good that we have to find the top and bottom a bit differently.
This tends to get into much more OS-specific territory though. For example, on Windows you'd probably use StackWalk64 to walk the stack and tell you the addresses of the top and bottom. On Linux, you'd typically use backtrace instead. On some other OS, chances are pretty good you'll use a function with some other name, producing data in some other format. But most will provide something on the same general order.
Once we have the addresses of the top and bottom of the stack, we're back to pretty much the same situation though: subtract the addresses of the top and bottom to find the size.
Using a Debugger
Another possibility would be to run the program under a debugger. Although we usually want to run the startup code before we do any debugging, in this case we can just load the program, and look at the ESP register to see where the stack is to start with. Then when we want to find the stack size we break, get the value in ESP and (of course) subtract to get the size.
Caveat
If you want to get technical, what I've shown in the first won't be precisely the entire entire size of the stack. There's typically some startup routine that does things like retrieving environment variables and the command line, formatting them as expected, calling constructors of global objects, etc., that runs before main. So there's some small amount of space this won't capture. But what it misses is a small (and usually fixed) quantity, so usually we simply won't care about it. But if we really need to know the precise amount, we can (for one example) look at the source code to the startup to figure it out.

Best practices to determine stack usage in Ravenscar program

I am writing an Ada program using the Ravenscar subset (thus, I am aware of the number of running tasks at execution time). The code is compiled by gcc with the -fstack-check switch enabled. This should cause the program raise a STORAGE_ERROR at runtime if any of my tasks exceed their stack.
Ada allows to set the upper limit for those (task-specific) stacks during the specification of the respective task like so:
pragma Storage_Size (Some_Value);
Now I was wondering what options I have to determine Some_Value. What I have heard of so far:
Do wild guesses until no STORAGE_ERROR is raised anymore. This is more or less what the OP suggests here.
Feed the output of -fstack-usage in there.
Use some gnat specific extensions as outlined here (how does this technically differ from item #2?).
Get a stack analyzer like gnatstack and let it do the work for you.
If I understand this correctly all the above techniques are dynamic (i.e. they require the program to run in order to work). Are static approaches also conceivable? E.g. by restricting myself further through some of Ada's high integrity options (such as No_Recursion, what else?).
Perhaps any of you can name some best practices to tackle this problem and/or extend/comment on my (surely incomplete) list.
Bonus question: What is the default size of a task's stack when the above pragma is not specified? GCC's docs only state this value depends on the runtime, without giving any concrete numbers.
You can generally check the stack space required by individual types with the 'Storage_Size attribute (which counts in bits).
Once you have tabulated this (you may need to round it up to whole words/double words), you can add up how much stack space is used by each declarative region, and then walk through your calls to find the maximum stack usage.

Can stack overflow happen without a recursive function call?

Usually when the program crashes due to stack overflow, it means there was a recursive call without a proper exit condition. But are there other ways to get the stack overflow?
If you allocate on the stack, yes, it can happen depending on the language. For example, using the C99 function alloca: it specifically says on the man page:
The allocation made may exceed the bounds of the stack, or even go further into other objects in memory, and alloca() cannot determine such an error.

Getting Stack overflow with GNU CLisp (Windows)

I'm getting "Program stack overflow RESET" message while running my program. So I set added a counter to see how many times I'm recursively calling the main function in my program. Turns out that it is around 30,000 times and the data I'm stacking are lists of length around 10 elements, which I think are not so many. My question is whether this amount of recursive call and memory usage are common or not, or is it more likely that I'm doing something wrong? I checked the resource manager of vista and found the memory only grew for like 1MB for lisp.exe process. And how do I adjust the stack overflow limit of CLisp?
http://clisp.cons.org/impnotes.html#faq-stack
Note that if you do tail calls and compile your function(s) there will be no limit at all.
1 MB seems to be the default stack size on Windows. I do not know if it is possible to change it without relinking the program, but in any case I would recommend either converting the program to tail-recursive form and using the CLisp byte compiler, which will optimize it away, or just converting it to iterative form. While many Common Lisp compilers do implement tail call optimization, the standard does not require it, so unbounded recursion should not be used.

What is the purpose of the EBP frame pointer register?

I'm a beginner in assembly language and have noticed that the x86 code emitted by compilers usually keeps the frame pointer around even in release/optimized mode when it could use the EBP register for something else.
I understand why the frame pointer might make code easier to debug, and might be necessary if alloca() is called within a function. However, x86 has very few registers and using two of them to hold the location of the stack frame when one would suffice just doesn't make sense to me. Why is omitting the frame pointer considered a bad idea even in optimized/release builds?
Frame pointer is a reference pointer allowing a debugger to know where local variable or an argument is at with a single constant offset. Although ESP's value changes over the course of execution, EBP remains the same making it possible to reach the same variable at the same offset (such as first parameter will always be at EBP+8 while ESP offsets can change significantly since you'll be pushing/popping things)
Why don't compilers throw away frame pointer? Because with frame pointer, the debugger can figure out where local variables and arguments are using the symbol table since they are guaranteed to be at a constant offset to EBP. Otherwise there isn't an easy way to figure where a local variable is at any point in code.
As Greg mentioned, it also helps stack unwinding for a debugger since EBP provides a reverse linked list of stack frames therefore letting the debugger to figure out size of stack frame (local variables + arguments) of the function.
Most compilers provide an option to omit frame pointers although it makes debugging really hard. That option should never be used globally, even in release code. You don't know when you'll need to debug a user's crash.
Just adding my two cents to already good answers.
It's part of a good language architecture to have a chain of stack frames. The BP points to the current frame, where subroutine-local variables are stored. (Locals are at negative offsets, and arguments are at positive offsets.)
The idea that it is preventing a perfectly good register from being used in optimization raises the question: when and where is optimization actually worthwhile?
Optimization is only worthwhile in tight loops that 1) do not call functions, 2) where the program counter spends a significant fraction of its time, and 3) in code the compiler actually will ever see (i.e. non-library functions). This is usually a very small fraction of the overall code, especially in large systems.
Other code can be twisted and squeezed to get rid of cycles, and it simply won't matter, because the program counter is practically never there.
I know you didn't ask this, but in my experience, 99% of performance problems have nothing at all to do with compiler optimization. They have everything to do with over-design.
It depends on the compiler, certainly. I've seen optimized code emitted by x86 compilers that freely uses the EBP register as a general purpose register. (I don't recall which compiler I noticed that with, though.)
Compilers may also choose to maintain the EBP register to assist with stack unwinding during exception handling, but again this depends on the precise compiler implementation.
However, x86 has very few registers
This is true only in the sense that opcodes can only address 8 registers. The processor itself will actually have many more registers than that and use register renaming, pipelining, speculative execution, and other processor buzzwords to get around that limit. Wikipedia has a good introductory paragraph as to what an x86 processor can do to overcome the register limit: http://en.wikipedia.org/wiki/X86#Current_implementations.
Using stack frames has gotten incredibly cheap in any hardware even remotely modern. If you have cheap stack frames then saving a couple of registers isn't as important. I'm sure fast stack frames vs. more registers was an engineering trade-off, and fast stack frames won.
How much are you saving going pure register? Is it worth it?

Resources