Measuring Stack Space - algorithm

Recently I interviewed with Intel, and I was asked this question
How would you measure the stack space without using the task manager,
when an application is running on a computer? Write an algorithm if
possible
I have no idea how to approach the question. I have even searched the internet but have not found any sort of related answer. So I am looking for help from the people of StackOverflow to help me understand the question and how to approach the solution.

First of all, I'd note that a great deal here relies on how C++ is most often implemented. What I'm going to describe will almost always work on any real implementation on an Intel x86, or anything remotely similar--but it'll probably break completely on something like an IBM mainframe or an early Cray (neither of which really supported a stack in hardware). For that matter, it's not really guaranteed to work on anything--but it will work about as expected on most current hardware (especially: just about anything with an Intel x86 processor).
A great deal here depends on whether we're trying to do this from inside the program, or from the outside (i.e., from another program).
Inside the Program
If we're doing it from inside the program, we can start main with something like this:
char *stack_bottom;
int main() {
int a;
stack_bottom = &a;
This gives us an address at the bottom of the stack (well, technically, not quite the bottom, but really close to it anyway).
Then at any given point that we want to know the current stack depth, we can call a function something like this:
ptrdiff_t stack_size() {
char a;
return abs(stack_bottom - &a);
}
This creates a variable at the top of the stack, then subtracts the two to get the size. The stack usually grows downward in memory, so the "top" of the stack will typically be at a lower address than the top. Since we only care about the magnitude of the difference, we take the absolute value of the difference (though we can avoid that if we know which direction the stack grows).
Windows-specific
Since this seems to be targeting Windows, we can use VirtualQuery to get similar information about stack memory.
Outside the Program
From the "outside" of a program the basic idea is pretty much the same: find the bottom of the stack, find the top of the stack, subtract to find the size. But, chances are pretty good that we have to find the top and bottom a bit differently.
This tends to get into much more OS-specific territory though. For example, on Windows you'd probably use StackWalk64 to walk the stack and tell you the addresses of the top and bottom. On Linux, you'd typically use backtrace instead. On some other OS, chances are pretty good you'll use a function with some other name, producing data in some other format. But most will provide something on the same general order.
Once we have the addresses of the top and bottom of the stack, we're back to pretty much the same situation though: subtract the addresses of the top and bottom to find the size.
Using a Debugger
Another possibility would be to run the program under a debugger. Although we usually want to run the startup code before we do any debugging, in this case we can just load the program, and look at the ESP register to see where the stack is to start with. Then when we want to find the stack size we break, get the value in ESP and (of course) subtract to get the size.
Caveat
If you want to get technical, what I've shown in the first won't be precisely the entire entire size of the stack. There's typically some startup routine that does things like retrieving environment variables and the command line, formatting them as expected, calling constructors of global objects, etc., that runs before main. So there's some small amount of space this won't capture. But what it misses is a small (and usually fixed) quantity, so usually we simply won't care about it. But if we really need to know the precise amount, we can (for one example) look at the source code to the startup to figure it out.

Related

Why are results of CRT and VS memory profiling so different?

Traditionally I have used the CRT memory reporting functions like this:
_CrtMemState state[3];
_CrtMemCheckpoint(&state[0]);
foo();
_CrtMemCheckpoint(&state[1]);
int const diff = _CrtMemDifference(&state[2], &state[0], &state[1]);
_CrtMemDumpStatistics(&state[2]);
More recently I have used Visual Studio's builtin heap profiling tool with snapshots. Create first snapshot before foo(), second snapshot after foo(), then look at the diff output.
Now I used both at the same time and compared the results. I expected both results to be pretty much the same, if not exactly the same. But this is not the case. Memory sizes vary wildly. The only thing they share is the number of allocations. I don't know what to make of this. How should I interpret those results? What causes the difference? Whom should I trust?
Note that the CRT results are the same indepedently of heap profiling being enabled or not.
So it seems that memory profiling is neither an exact science nor a very popular one.
Short answer to my own question: Just pick any type of measurement, then stick with it and compare the abstract numbers for every measurement you take. But never start wondering what those numbers could mean.
FYI, here's what I explored:
Snapshot diffing by _CrtMemCheckpoint / _CrtMemDifference.
In addition _CrtSetAllocHook to track total allocations during the task, because _CrtMemState's lHighWaterCount (peak since app start, not since last snapshot) and lTotalCount (overflows, sometimes weird glitches) are not reliable. Sadly, _CrtSetAllocHook does not enable you to match allocs and de-allocs.
GetProcessHeap / HeapSummary to inspect the default process heap.
GetProcessMemoryInfo yields similar numbers as HeapSummary, but of course not the same. Occasionally, there's even a large gap between the two. Apparently GetProcessMemoryInfo is also providing the values you see in Windows' TaskManager.
In the end, I used _CrtMemCheckpoint, _CrtMemDifference and _CrtSetAllocHook in debug, because I felt that I could interpret those numbers. However, these functions are not available in release, so I used GetProcessMemoryInfo there. Had no clue how to interpret those numbers, but everytime they went down because of my optimizations, they gave me a happy face.

How can I subdivide a function when using Instruments Time Profiler

I've got a relatively long function that's dominant in the Instruments Time Profiler. Is there a way to add additional symbols to this function so the sampling will show time allocated to different parts of the function? I'm looking for something like the MARK macro that existed for prof(1) years ago.
Using the macro:
#define MARK(K) asm("M."#K":");
has been working well for me. This is really just a simplification of the old MARK macro I mentioned in my original question. Placing MARK(LOOP1); somewhere in a function will add a new symbol, M.LOOP1 that will show up in a list of functions shown by shark or instruments.
I discovered recently that in the time profiler in instruments, if you double-click on a method, it'll show you your source code with percentages of time spent on each line.
http://douglasheriot.com/blog/2011/04/xcode-4-instruments-awesomeness/
I've found it very useful, but I'm not sure if that's what you're asking for.
I'm told that Shark can do this, so Instruments should also, but you have to tell it what to do:
Do sampling, on wall-clock time (not just CPU time), of the function call stack (not just the program counter PC).
To tell you the lines of code (not just functions) that appear on a good percentage of stack samples.
A stack sample includes the PC and every call instruction leading to where the PC is.
Every instruction on the stack is jointly responsible for that slice of time being spent.
So any line of code responsible for X% of the time will be on the stack X% of the time. If it's big enough to be worth looking at, you will see it on the samples. You may get a lot of samples, but you don't actually need a lot. This is because it's more important to locate the problem than to measure it with much precision.
If your biggest problem, when fixed, would save you 5%, it will appear on about 5% or more of samples. If it's any smaller than that, your code's pretty optimal. Chances are it's a lot bigger than that, so you won't have any trouble seeing precisely where it is.
Added: An example of a profiler that does wall-time stack sampling and shows percent-by-line is Zoom, so I suggest you watch that video. Then, try to get Instruments to do the same thing.
UPDATE:
I have updated the code and created a separate project:
https://github.com/nielsbot/Profiler
I have some code that can do this here: I have some code you might fin that can do this here: https://gist.github.com/952456 HTH
You can profile sections of your function using this code like this:
-(void)myMethod
{
ProfilerEnter( __PRETTY_FUNCTION__ );
// ... code ...
{
ProfilerEnter("operation x");
// your code here
// ...
ProfilerExit("operation x");
}
ProfilerExit(__PRETTY_FUNCTION__);
}

Can this kernel function be more readable? (Ideas needed for an academic research!)

Following my previous question regarding the rationale behind extremely long functions, I would like to present a specific question regarding a piece of code I am studying for my research. It's a function from the Linux Kernel which is quite long (412 lines) and complicated (an MCC index of 133). Basically, it's a long and nested switch statement
Frankly, I can't think of any way to improve this mess. A dispatch table seems both huge and inefficient, and any subroutine call would require an inconceivable number of arguments in order to cover a large-enough segment of code.
Do you think of any way this function can be rewritten in a more readable way, without losing efficiency? If not, does the code seem readable to you?
Needless to say, any answer that will appear in my research will be given full credit - both here and in the submitted paper.
Link to the function in an online source browser
I don't think that function is a mess. I've had to write such a mess before.
That function is the translation into code of a table from a microprocessor manufacturer. It's very low-level stuff, copying the appropriate hardware registers for the particular interrupt or error reason. In this kind of code, you often can't touch registers which have not been filled in by the hardware - that can cause bus errors. This prevents the use of code that is more general (like copying all registers).
I did see what appeared to be some code duplication. However, at this level (operating at interrupt level), speed is more important. I wouldn't use Extract Method on the common code unless I knew that the extracted method would be inlined.
BTW, while you're in there (the kernel), be sure to capture the change history of this code. I have a suspicion that you'll find there have not been very many changes in here, since it's tied to hardware. The nature of the changes over time of this sort of code will be quite different from the nature of the changes experienced by most user-mode code.
This is the sort of thing that will change, for instance, when a new consolidated IO chip is implemented. In that case, the change is likely to be copy and paste and change the new copy, rather than to modify the existing code to accommodate the changed registers.
Utterly horrible, IMHO. The obvious first-order fix is to make each case in the switch a call to a function. And before anyone starts mumbling about efficiency, let me just say one word - "inlining".
Edit: Is this code part of the Linux FPU emulator by any chance? If so this is very old code that was a hack to get linux to work on Intel chips like the 386 which didn't have an FPU. If it is, it's probably not a suitable study for academics, except for historians!
There's a kind of regularity here, I suspect that for a domain expert this actually feels very coherent.
Also having the variations in close proximty allows immediate visual inspection.
I don't see a need to refactor this code.
I'd start by defining constants for the various classes. Coming into this code cold, it's a mystery what the switching is for; if the switching was against named constants, I'd have a starting point.
Update: You can get rid of about 70 lines where the cases return MAJOR_0C_EXCP; simply let them fall through to the end of the routine. Since this is kernel code I'll mention that there might be some performance issues with that, particularly if the case order has already been optimized, but it would at least reduce the amount of code you need to deal with.
I don't know much about kernels or about how re-factoring them might work.
The main thing that comes to my mind is taking that switch statement and breaking each sub step in to a separate function with a name that describes what the section is doing. Basically, more descriptive names.
But, I don't think this optimizes the function any more. It just breaks it in to smaller functions of which might be helpful... I don't know.
That is my 2 cents.

What's a good profiling tool to use when source code isn't available?

I have a big problem. My boss said to me that he wants two "magic black box":
1- something that receives a micropocessor like input and return, like output, the MIPS and/or MFLOPS.
2- something that receives a c code like input and return, like output, something that can characterize the code in term of performance (something like the necessary MIPS that a uP must have to execute the code in some time).
So the first "black box" I think could be a benchmark of EEMBC or SPEC...different uP, same benchmark that returns MIPS/MFLOPS of each uP. The first problem is OK (I hope)
But the second...the second black box is my nightmare...the only thingh that i find is to use profiling tool but I ask a particular profiling tool.
Is there somebody that know a profiling tool that can have, like input, simple c code and gives me, like output, the performance characteristics of my c code (or the times that some assembly instruction is called)?
The real problem is that we must choose the correct uP for a certai c code...but we want a uP tailored for our c code...so if we know a MIPS (and architectural structure of uP, memory structure...) and what our code needed
Thanks to everyone
I have to agree with Adam, though I would be a little more gracious about it. Compiler optimizations only matter in hotspot code, i.e. tight loops that a) don't call functions, and b) take a large percentage of time.
On a positive note, here's what I would suggest:
Run the C code on a processor, any processor. On that processor, find out what takes the most time.
You could use a profiler for this. The simple method I prefer is to just run it under a debugger and manually halt it, some number of times (like 10) and each time write down the call stack. I suppose there is something in the code taking a good percentage of the time, like 50%. If so, you will see it doing that thing on roughly that percentage of samples, so you won't have to guess what it is.
If that activity is something that would be helped by some special processor, then try that processor.
It is important not to guess. If you say "I think this needs a DSP chip", or "I think it needs a multi-core chip", that is a guess. The guess might be right, but probably not. It is probably the case that what takes the most time is something you never would guess, like memory management or I/O formatting. Performance issues are very good at hiding from you.
No. If someone made a tool that could analyse (non-trivial) source code and tell you its performance characteristics, it would be common place. i.e. everyone would be using it.
Until source code is compiled for a particular target architecture, you will not be able to determine its overall performance. For instance, a parallelising compiler targeting n processors might conceivably be able to change an O(n^2) algorithm to one of O(n).
You won't find a tool to do what you want.
Your only option is to cross-compile the code and profile it on an emulator for the architecture you're running. The problem with profiling high level code is the compiler makes a stack of optimizations that are non trivial and you'd need to know how the particular compiler did that.
It sounds dumb, but why do you want to fit your code to a uP and a uP to your code? If you're writing signal processing buy a DSP. If you're building a SCADA box then look into Atmel or ARM stuff. Are you building a general purpose appliance with a user interface? Look into PPC or X86 compatible stuff.
Simply put, choose a bloody architecture that's suitable and provides the features you need. Optimization before choosing the processor is retarded (very roughly paraphrasing Knuth).
Fix the architecture at something roughly appropriate, work out roughly the processing requirements (you can scratch up an estimate by hand which will always be too high when looking at C code) and buy a uP to match.

What is the purpose of the EBP frame pointer register?

I'm a beginner in assembly language and have noticed that the x86 code emitted by compilers usually keeps the frame pointer around even in release/optimized mode when it could use the EBP register for something else.
I understand why the frame pointer might make code easier to debug, and might be necessary if alloca() is called within a function. However, x86 has very few registers and using two of them to hold the location of the stack frame when one would suffice just doesn't make sense to me. Why is omitting the frame pointer considered a bad idea even in optimized/release builds?
Frame pointer is a reference pointer allowing a debugger to know where local variable or an argument is at with a single constant offset. Although ESP's value changes over the course of execution, EBP remains the same making it possible to reach the same variable at the same offset (such as first parameter will always be at EBP+8 while ESP offsets can change significantly since you'll be pushing/popping things)
Why don't compilers throw away frame pointer? Because with frame pointer, the debugger can figure out where local variables and arguments are using the symbol table since they are guaranteed to be at a constant offset to EBP. Otherwise there isn't an easy way to figure where a local variable is at any point in code.
As Greg mentioned, it also helps stack unwinding for a debugger since EBP provides a reverse linked list of stack frames therefore letting the debugger to figure out size of stack frame (local variables + arguments) of the function.
Most compilers provide an option to omit frame pointers although it makes debugging really hard. That option should never be used globally, even in release code. You don't know when you'll need to debug a user's crash.
Just adding my two cents to already good answers.
It's part of a good language architecture to have a chain of stack frames. The BP points to the current frame, where subroutine-local variables are stored. (Locals are at negative offsets, and arguments are at positive offsets.)
The idea that it is preventing a perfectly good register from being used in optimization raises the question: when and where is optimization actually worthwhile?
Optimization is only worthwhile in tight loops that 1) do not call functions, 2) where the program counter spends a significant fraction of its time, and 3) in code the compiler actually will ever see (i.e. non-library functions). This is usually a very small fraction of the overall code, especially in large systems.
Other code can be twisted and squeezed to get rid of cycles, and it simply won't matter, because the program counter is practically never there.
I know you didn't ask this, but in my experience, 99% of performance problems have nothing at all to do with compiler optimization. They have everything to do with over-design.
It depends on the compiler, certainly. I've seen optimized code emitted by x86 compilers that freely uses the EBP register as a general purpose register. (I don't recall which compiler I noticed that with, though.)
Compilers may also choose to maintain the EBP register to assist with stack unwinding during exception handling, but again this depends on the precise compiler implementation.
However, x86 has very few registers
This is true only in the sense that opcodes can only address 8 registers. The processor itself will actually have many more registers than that and use register renaming, pipelining, speculative execution, and other processor buzzwords to get around that limit. Wikipedia has a good introductory paragraph as to what an x86 processor can do to overcome the register limit: http://en.wikipedia.org/wiki/X86#Current_implementations.
Using stack frames has gotten incredibly cheap in any hardware even remotely modern. If you have cheap stack frames then saving a couple of registers isn't as important. I'm sure fast stack frames vs. more registers was an engineering trade-off, and fast stack frames won.
How much are you saving going pure register? Is it worth it?

Resources