How to analyze the computational overhead of a Processing Sketch - processing

Please forgive me if the terms are inaccurate or this is the wrong forum.
Essentially I write sketches in Processing and I am struggling to find out why my code runs slow.
Sometimes a sketch runs fast and I have no idea why other than there are less lines of code. Sometimes a different sketch runs slow and I have no idea why.
I am curious if there is a way within the Processing IDE, or maybe a general tool, to determine or analyze which lines of code are causing the sketch to run slow?
Such as a way to isolate "Oh, these lines are causing it to run the slowest. Looks like it's a section of this loop. Maybe I should concentrate on improving this function rather than searching for a needle in a haystack."
Similar to how when my computer is running slow I can use task manager to take a look at which programs are running slow and then adjust. I don't just guess. Or develop an unfounded penchant for quitting one program over another.
I of course could upload my sketch but this is a example independent problem I am trying to get better at solving. I would like to find a way to analyze all sketches. Or even code in different languages, etc.
If there is no general tool for analyzing a Processing sketch how do people go about this? Surely there must be a better method than trial and error, brute force, intuition, or otherwise. Of course those methods could yield better running code but there must be some formal analysis.
Even if you didn't have anything specific to share for Processing any suggestions on search terms, subjects, or topics would be appreciated as I have no idea how to proceed other than the brute force/trial and error method.
Thank you,

Without an example code snippet I can only provide a couple of general approaches.
The simplest thing you could do is use time sections of code and add print statements (poor man's profiler). The idea is to take a time snapshot before running a function/block of code you suspect is slow, take another one after then print the difference. The chunks of code with the largest time difference is what you want to focus on. Here's a minimal example, assuming doSomethingInteresting() is a function that you suspect is slow
// get the current millis since the sketch started
int before = millis();
// call the function here
doSomethingInteresting();
// get the millis again, after the function and calculate the time difference
int diff = millis() - before;
println("executed in ~" + diff + "ms");
If you use Processing's default Java mode you can use VisualVM to sample your PApplet. In your case you will want to sample CPU usage and the nice thing about VisualVM is that you will the results sorted by the slowest function calls (the first you should improve) and you can drill down to see what is your code versus what is part of the runtime or other native parts of code.

Related

Competitive Programming : Generating test cases and validating the program correctness

I have been doing sport programming for a while and still improving day by day. But one thing I have always wondered is that it would be really nice if I could automate the test-case generation process and cross-validation of my program. Definitely it would be a brute force approach as some test cases would be algorithm specific.
Doing a google search gives me a nice link on Quora : How do programming contest problem setters make test cases ? and the popular testlib used by problem setters.
But isn't this a chicken-egg problem?
Assume I generated 1 million input test cases, but what would I check them against? How will I generate the outputs? Because I am still in the process of validating the program... If my script generates the correct outputs as well, then whats the point of writing the program in the first place. I can submit the script itself. Also, its not possible to write 1 million outputs for generated test cases manually. Can anyone please clarify this confusion.
I hope i have clarified the problem correctly.
It's common to generate the answer by a slow but obviously correct solution (like an exhaustive search). It can't be used as a main solution as it's too slow for large test cases, but you can check the output of your fast (but possibly incorrect) program using it.
Well the thing is it is not as broad as you think it to be. Test generation in competitive programming is guided by the algorithm of the problem and it's correctness proof.
So when you are thinking that there are million of test cases if you analyze the different situations the program can be then you will likely to get all the test cases. Maybe in certain algorithm you are some times processing the even index elements or the odd index elements of an array. Now what you will do? Divide it in 2 cases even or odd. Consider the smallest case for even ones . Same for odd ones. This way you are basically visiting all the control flow path of the program.
In competitive programming as we first determine the algorithm then we decide on a proper input sizes and then all this test cases and validations, it is often easy to think the corner points. Test case for 1000000 elements or when input is 0 or 1 ...test cases like this.
Another is most of the time we write a brute force solution much more slower than the original one. Now what we do? we just generate random medium size test cases and then run it again the slow program and we can check with our checker solution etc.
correctness is guided by some mathematical proof also.(Heuristics, Induction, Box principle, Number Theory etc ) That way we are sure about the correctness of the solution.
I faced the same issue earlier this year and saw some of my colleagues also figuring out a way to deal with this. Because there are sometimes when I just couldn’t think any more test cases and that’s when I decided to make a test case generator tool of my own, It’s free and open-source so anyone can use it.
You can easily generate a lot of test cases using this tool and validate the result using the output given from the correct but slower approach(in terms of time complexity and space complexity). You can either run them parallelly and check for outputs or write a simple script to compare the output of both programs (the slower but correct one and the better but unsure one) to validate.
I believe good coders won’t be needing it anyway, but for the middle level (div2, div3) coders and newbies, it can prove to be a lot helpful.
You can access it from GitHub : Test Case generator.
Both python source code and .exe files are present there with instructions,
If you want to make some changes of your own you can work with the python file.
If you just directly want it to generate some test cases you should prefer the .exe file(inside the zip).
It’ll help a lot if you’re beginning your Competitive programming journey.
Also, any suggestions or improvements are always welcome. Additionally you can also contribute to this project by adding some feature that you think the project is lacking or by adding some new test case formats by yourselves or making by request for the same.

Profiling vs debugging what is the practice

Most often, I do wounder how best to make my applications have optimal performance. How to optimize and identify functions/methods that are more resource intensive than the others and make necessary adjustments. In software development irrespective of language I believe that, there should some ways of finding out how the processor/network resources are being used by different parts of my codes. I will illustrate what I mean using the simplest example I can think of: I have background on Java, Python and PHP and feel more comfortable working on linux environment. Please feel free to advice me using any of these languages you are comfortable with:
In Javascript one can comfortably test and assign a value to a variable by doing:
//METHOD 1:
if(true){
console.log("It will always be true");
}else{
console.log("You can never see me");
}
//METHOD 2:
var print;
if(true){
print="It will always be true";
}else{
print="You can never see me";
}
console.log(print);
//METHOD 3:
console.log((true)? "It will always be true" : "You can never see me");
If different people were to be asked which of these methods will perform faster than the other. I am sure that different individuals will come up with different ideas. But I need a more reliable way to know about resource usage both on desktop and mobile applications. Thanks.
First of all, Debugging is done when you know the exact bug or wrong functionality.
It is done for functional testing i.e. to check defects in application.
For performance, profilers are used to find out resource intensive methods. they will provide heavy modules/functions/DB queries etc. so that after analysis you can tune and improve your system performance.If after modification some defect arises then you debugger can be used to pinpoint and correct the issue.
There are many opensource as well as paid profilers for java(I am saying java because you pasted javascript code).
Please have a look at them and use them to tune your system.
IMHO downvoting was for 2 reasons bad english and very basic question.
In the examples you gave, it simply doesn't matter. That's like asking which is faster, a snail or a worm BUT, if by
"profiling" you mean "measurements of time taken by functions", and if by
"debugging performance problem" you mean "the bug is that time is being spent unnecessarily, and I need to find out why",
then I strongly agree with the premise of your question.
Debugging works much better.
A performance problem consists of excess time being spent for unnecessary reasons.
The way to find it is by breaking into the execution at random times, which will naturally gravitate toward whatever is taking the most time.
The more time the problem takes, the more likely the break is to land in the problem.
Just do it several times, and each time look carefully at the program's state, to understand why it's doing whatever it's doing.
If you see it doing something that can be avoided, and you see it doing it on more than one break, you've found a performance problem.
Fix it and observe the speedup.
Then repeat the process, because what was the next biggest problem is now the biggest one.
The speedups multiply together.
Here's an example where the time goes from 2700us to 1800, then to 1500, 1300, 440, 170, and finally 3.7us. None of the speedups were all that big as fractions of the original time, but in the aggregate, they were stunning.
This is very much different from measuring.
In measuring, the first thing that happens is you assume numerical accuracy is important, so you assume you either need lots of samples or you have to instrument the code to measure time precisely.
As if numerical accuracy helps to find the problem.
This false assumption entered programmers' consciousness around 1982, when GPROF appeared.
In measuring, you assume the problem can be localized to a function, when in fact you need to see all of what's happening at a point in time to know if it can be avoided.
IMHO, the best profilers sample the stack, on wall-clock time, and report for each line of code that appears on samples, the percent of samples it appears on.
However, even these profilers don't tell you the context that you can get by simply looking carefully at individual stack samples, and data as well.
(Other forms of eye-candy have the same problem: call-graph, hot-path, flame-graph, etc.)

MATLAB: GUI progressively getting slower

I've been programming some MATLAB GUIs (not using GUIDE), mainly for viewing images and some other simple operations (such as selecting points and plotting some data from the images).
When the GUI starts, all the operations are performed quickly.
However, as the GUI is used (showing different frames from 3D/4D volumes and perfoming the operations mentioned above), it starts getting progressively slower, reaching a point where it is too slow for common usage.
I would like to hear some input regarding:
Possible strategies to find out why the GUI is getting slower;
Good MATLAB GUI programming practices to avoid this;
Possible references that address these issues.
I'm using set/getappdata to save variables in the main figure of the GUI and communicate between functions.
(I wish I could provide a minimal working example, but I don't think it is suitable in this case because this only happens in somewhat more complex GUIs.)
Thanks a lot.
EDIT: (Reporting back some findings using the profiler:)
I used the profiler in two occasions:
immediatly after starting the GUI;
after playing around with it for some time, until it started getting too slow.
I performed the exact same procedure in both profiling operations, which was simply moving the mouse around the GUI (same "path" both times).
The profiler results are as follows:
I am having difficulties in interpreting these results...
Why is the number of calls of certain functions (such as impixelinfo) so bigger in the second case?
Any opinions?
Thanks a lot.
The single best way I have found around this problem was hinted at above: forced garbage collection. Great advice though the command forceGarbageCollection is not recognized in MATLAB. The command you want is java.lang.System.gc()... such a beast.
I was working on a project wherein I was reading 2 serial ports at 40Hz (using a timer) and one NIDAQ at 1000Hz (using startBackground()) and graphing them all in real-time. MATLAB's parallel processing limitations ensured that one of those processes would cause a buffer choke at any given time. Animations would not be able to keep up, and eventually freeze, etc. I gained some initial success by making sure that I was defining a single plot and only updating parameters that changed inside my animation loop with the set command. (ex. figure, subplot(311), axis([...]),hold on, p1 = plot(x1,y1,'erasemode','xor',...); etc. then --> tic, while (toc<8) set(p1,'xdata',x1,'ydata',y1)...
Using set will make your animations MUCH faster and more fluid. However, you will still run into the buffer wall if you animate long enough with too much going on in the background-- especially real-time data inputs. Garbage collection is your answer. It isn't instantaneous so you don't want it to execute every loop cycle unless your loop is extremely long. My solution is to set up a counter variable outside the while loop and use a mod function so that it only executes every 'n' cycles (ex. counter = 0; while ()... counter++; if (~mod(counter,n)) java.lang.System.gc(); and so on.
This will save you (and hopefully others) loads of time and headache, trust me, and you will have MATLAB executing real-time data acq and animation on par with LabVIEW.
A good strategy to find out why anything is slow in Matlab is to use the profiler. Here is the basic way to use the profiler:
profile on
% do stuff now that you want to measure
profile off
profile viewer
I would suggest profiling a freshly opened GUI, and also one that has been open for a while and is noticeably slow. Then compare results and look for functions that have a significant increase in "Self Time" or "Total Time" for clues as to what is causing the slowdown.

Practical tips debugging deep recursion?

I'm working on a board game algorithm where a large tree is traversed using recursion, however, it's not behaving as expected. How do I handle this and what are you experiences with these situations?
To make things worse, it's using alpha-beta pruning which means entire parts of the tree are never visited, as well that it simply stops recursion when certain conditions are met. I can't change the search-depth to a lower number either, because while it's deterministic, the outcome does vary by how deep is searched and it may behave as expected at a lower search-depth (and it does).
Now, I'm not gonna ask you "where is the problem in my code?" but I am looking for general tips, tools, visualizations, anything to debug code like this. Personally, I'm developing in C#, but any and all tools are welcome. Although I think that this may be most applicable to imperative languages.
Logging. Log in your code extensively. In my experience, logging is THE solution for these types of problems. when it's hard to figure out what your code is doing, logging it extensively is a very good solution, as it lets you output from within your code what the internal state is; it's really not a perfect solution, but as far as I've seen, it works better than using any other method.
One thing I have done in the past is to format your logs to reflect the recursion depth. So you may do a new indention for every recurse, or another of some other delimiter. Then make a debug dll that logs everything you need to know about a each iteration. Between the two, you should be able to read the execution path and hopefully tell whats wrong.
I would normally unit-test such algorithms with one or more predefined datasets that have well-defined outcomes. I would typically make several such tests in increasing order of complexity.
If you insist on debugging, it is sometimes useful to doctor the code with statements that check for a given value, so you can attach a breakpoint at that time and place in the code:
if ( depth = X && item.id = 32) {
// Breakpoint here
}
Maybe you could convert the recursion into an iteration with an explicit stack for the parameters. Testing is easier in this way because you can directly log values, access the stack and don't have to pass data/variables in each self-evaluation or prevent them from falling out of scope.
I once had a similar problem when I was developing an AI algorithm to play a Tetris game. After trying many things a loosing a LOT of hours in reading my own logs and debugging and stepping in and out of functions what worked out for me was to code a fast visualizer and test my code with FIXED input.
So, if time is not a problem and you really want to understand what is going on, get a fixed board state and SEE what your program is doing with the data using a mix of debug logs/output and some sort of your own tools that shows information on each step.
Once you find a board state that gives you this problem, try to pin-point the function(s) where it starts and then you will be in a position to fix it.
I know what a pain this can be. At my job, we are currently working with a 3rd party application that basically behaves as a black box, so we have to devise some interesting debugging techniques to help us work around issues.
When I was taking a compiler theory course in college, we used a software library to visualize our trees; this might help you as well, as it could help you see what the tree looks like. In fact, you could build yourself a WinForms/WPF application to dump the contents of your tree into a TreeView control--it's messy, but it'll get the job done.
You might want to consider some kind of debug output, too. I know you mentioned that your tree is large, but perhaps debug statements or breaks at key point during execution that you're having trouble visualizing would lend you a hand.
Bear in mind, too, that intelligent debugging using Visual Studio can work wonders. It's tough to see how state is changing across multiple breaks, but Visual Studio 2010 should actually help with this.
Unfortunately, it's not particularly easy to help you debug without further information. Have you identified the first depth at which it starts to break? Does it continue to break with higher search depths? You might want to evaluate your working cases and try to determine how it's different.
Since you say that the traversal is not working as expected, I assume you have some idea of where things may go wrong. Then inspect the code to verify that you have not overlooked something basic.
After that I suggest you set up some simple unit tests. If they pass, then keep adding tests until they fail. If they fail, then reduce the tests until they either pass or are as simple as they can be. That should help you pinpoint the problems.
If you want to debug as well, I suggest you employ conditional breakpoints. Visual Studio lets you modify breakpoints, so you can set conditions on when the breakpoint should be triggered. That can reduce the number of iterations you need to look at.
I would start by instrumenting the function(s). At each recursive call log the data structures and any other info that will be useful in helping you identify the problem.
Print out the dump along with the source code then get away from the computer and have a nice paper-based debugging session over a cup of coffee.
Start from the base case where you've mentioned if else statements and then try to channelize your thinking by writing it down on pen and paper + printing the values on console when the first few instances of recursive functions are generated with values.
The motto is to find the correct trend between the values you print and match them with those values you wrote on paper in the initial few steps of your recursive algorithm.

One could use a profiler, but why not just halt the program? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
If something is making a single-thread program take, say, 10 times as long as it should, you could run a profiler on it. You could also just halt it with a "pause" button, and you'll see exactly what it's doing.
Even if it's only 10% slower than it should be, if you halt it more times, before long you'll see it repeatedly doing the unnecessary thing. Usually the problem is a function call somewhere in the middle of the stack that isn't really needed. This doesn't measure the problem, but it sure does find it.
Edit: The objections mostly assume that you only take 1 sample. If you're serious, take 10. Any line of code causing some percentage of wastage, like 40%, will appear on the stack on that fraction of samples, on average. Bottlenecks (in single-thread code) can't hide from it.
EDIT: To show what I mean, many objections are of the form "there aren't enough samples, so what you see could be entirely spurious" - vague ideas about chance. But if something of any recognizable description, not just being in a routine or the routine being active, is in effect for 30% of the time, then the probability of seeing it on any given sample is 30%.
Then suppose only 10 samples are taken. The number of times the problem will be seen in 10 samples follows a binomial distribution, and the probability of seeing it 0 times is .028. The probability of seeing it 1 time is .121. For 2 times, the probability is .233, and for 3 times it is .267, after which it falls off. Since the probability of seeing it less than two times is .028 + .121 = .139, that means the probability of seeing it two or more times is 1 - .139 = .861. The general rule is if you see something you could fix on two or more samples, it is worth fixing.
In this case, the chance of seeing it in 10 samples is 86%. If you're in the 14% who don't see it, just take more samples until you do. (If the number of samples is increased to 20, the chance of seeing it two or more times increases to more than 99%.) So it hasn't been precisely measured, but it has been precisely found, and it's important to understand that it could easily be something that a profiler could not actually find, such as something involving the state of the data, not the program counter.
On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.
This technique can reveal more performance problems in 2 minutes than any other technique I know of.
Because sometimes it works, and sometimes it gives you completely wrong answers. A profiler has a far better record of finding the right answer, and it usually gets there faster.
Doing this manually can't really be called "quick" or "effective", but there are several profiling tools which do this automatically; also known as statistical profiling.
Callstack sampling is a very useful technique for profiling, especially when looking at a large, complicated codebase that could be spending its time in any number of places. It has the advantage of measuring the CPU's usage by wall-clock time, which is what matters for interactivity, and getting callstacks with each sample lets you see why a function is being called. I use it a lot, but I use automated tools for it, such as Luke Stackwalker and OProfile and various hardware-vendor-supplied things.
The reason I prefer automated tools over manual sampling for the work I do is statistical power. Grabbing ten samples by hand is fine when you've got one function taking up 40% of runtime, because on average you'll get four samples in it, and always at least one. But you need more samples when you have a flat profile, with hundreds of leaf functions, none taking more than 1.5% of the runtime.
Say you have a lake with many different kinds of fish. If 40% of the fish in the lake are salmon (and 60% "everything else"), then you only need to catch ten fish to know there's a lot of salmon in the lake. But if you have hundreds of different species of fish, and each species is individually no more than 1%, you'll need to catch a lot more than ten fish to be able to say "this lake is 0.8% salmon and 0.6% trout."
Similarly in the games I work on, there are several major systems each of which call dozens of functions in hundreds of different entities, and all of this happens 60 times a second. Some of those functions' time funnels into common operations (like malloc), but most of it doesn't, and in any case there's no single leaf that occupies more than 1000 μs per frame.
I can look at the trunk functions and see, "we're spending 10% of our time on collision", but that's not very helpful: I need to know exactly where in collision, so I know which functions to squeeze. Just "do less collision" only gets you so far, especially when it means throwing out features. I'd rather know "we're spending an average 600 μs/frame on cache misses in the narrow phase of the octree because the magic missile moves so fast and touches lots of cells," because then I can track down the exact fix: either a better tree, or slower missiles.
Manual sampling would be fine if there were a big 20% lump in, say, stricmp, but with our profiles that's not the case. Instead I have hundreds of functions that I need to get from, say, 0.6% of frame to 0.4% of frame. I need to shave 10 μs off every 50 μs function that is called 300 times per second. To get that kind of precision, I need more samples.
But at heart what Luke Stackwalker does is what you describe: every millisecond or so, it halts the program and records the callstack (including the precise instruction and line number of the IP). Some programs just need tens of thousands of samples to be usefully profiled.
(We've talked about this before, of course, but I figured this was a good place to summarize the debate.)
There's a difference between things that programmers actually do, and things that they recommend others do.
I know of lots of programmers (myself included) that actually use this method. It only really helps to find the most obvious of performance problems, but it's quick and dirty and it works.
But I wouldn't really tell other programmers to do it, because it would take me too long to explain all the caveats. It's far too easy to make an inaccurate conclusion based on this method, and there are many areas where it just doesn't work at all. (for example, that method doesn't reveal any code that is triggered by user input).
So just like using lie detectors in court, or the "goto" statement, we just don't recommend that you do it, even though they all have their uses.
I'm surprised by the religous tone on both sides.
Profiling is great, and certainly is a more refined and precise when you can do it. Sometimes you can't, and it's nice to have a trusty back-up. The pause technique is like the manual screwdriver you use when your power tool is too far away or the bateries have run-down.
Here is a short true story. An application (kind of a batch proccessing task) had been running fine in production for six months, suddenly the operators are calling developers because it is going "too slow". They aren't going to let us attach a sampling profiler in production! You have to work with the tools already installed. Without stopping the production process, just using Process Explorer, (which operators had already installed on the machine) we could see a snapshot of a thread's stack. You can glance at the top of the stack, dismiss it with the enter key and get another snapshot with another mouse click. You can easily get a sample every second or so.
It doesn't take long to see if the top of the stack is most often in the database client library DLL (waiting on the database), or in another system DLL (waiting for a system operation), or actually in some method of the application itself. In this case, if I remember right, we quickly noticed that 8 times out of 10 the application was in a system DLL file call reading or writing a network file. Sure enough recent "upgrades" had changed the performance characteristics of a file share. Without a quick and dirty and (system administrator sanctioned) approach to see what the application was doing in production, we would have spent far more time trying to measure the issue, than correcting the issue.
On the other hand, when performance requirements move beyond "good enough" to really pushing the envelope, a profiler becomes essential so that you can try to shave cycles from all of your closely-tied top-ten or twenty hot spots. Even if you are just trying to hold to a moderate performance requirement durring a project, when you can get the right tools lined-up to help you measure and test, and even get them integrated into your automated test process it can be fantasticly helpful.
But when the power is out (so to speak) and the batteries are dead, it's nice know how to use that manual screwdriver.
So the direct answer: Know what you can learn from halting the program, but don't be afraid of precision tools either. Most importantly know which jobs call for which tools.
Hitting the pause button during the execution of a program in "debug" mode might not provide the right data to perform any performance optimizations. To put it bluntly, it is a crude form of profiling.
If you must avoid using a profiler, a better bet is to use a logger, and then apply a slowdown factor to "guesstimate" where the real problem is. Profilers however, are better tools for guesstimating.
The reason why hitting the pause button in debug mode, may not give a real picture of application behavior is because debuggers introduce additional executable code that can slowdown certain parts of the application. One can refer to Mike Stall's blog post on possible reasons for application slowdown in a debugging environment. The post sheds light on certain reasons like too many breakpoints,creation of exception objects, unoptimized code etc. The part about unoptimized code is important - the "debug" mode will result in a lot of optimizations (usually code in-lining and re-ordering) being thrown out of the window, to enable the debug host (the process running your code) and the IDE to synchronize code execution. Therefore, hitting pause repeatedly in "debug" mode might be a bad idea.
If we take the question "Why isn't it better known?" then the answer is going to be subjective. Presumably the reason why it is not better known is because profiling provides a long term solution rather than a current problem solution. It isn't effective for multi-threaded applications and isn't effective for applications like games which spend a significant portion of its time rendering.
Furthermore, in single threaded applications if you have a method that you expect to consume the most run time, and you want to reduce the run-time of all other methods then it is going to be harder to determine which secondary methods to focus your efforts upon first.
Your process for profiling is an acceptable method that can and does work, but profiling provides you with more information and has the benefit of showing you more detailed performance improvements and regressions.
If you have well instrumented code then you can examine more than just the how long a particular method; you can see all the methods.
With profiling:
You can then rerun your scenario after each change to determine the degree of performance improvement/regression.
You can profile the code on different hardware configurations to determine if your production hardware is going to be sufficient.
You can profile the code under load and stress testing scenarios to determine how the volume of information impacts performance
You can make it easier for junior developers to visualise the impacts of their changes to your code because they can re-profile the code in six months time while you're off at the beach or the pub, or both. Beach-pub, ftw.
Profiling is given more weight because enterprise code should always have some degree of profiling because of the benefits it gives to the organisation of an extended period of time. The more important the code the more profiling and testing you do.
Your approach is valid and is another item is the toolbox of the developer. It just gets outweighed by profiling.
Sampling profilers are only useful when
You are monitoring a runtime with a small number of threads. Preferably one.
The call stack depth of each thread is relatively small (to reduce the incredible overhead in collecting a sample).
You are only concerned about wall clock time and not other meters or resource bottlenecks.
You have not instrumented the code for management and monitoring purposes (hence the stack dump requests)
You mistakenly believe removing a stack frame is an effective performance improvement strategy whether the inherent costs (excluding callees) are practically zero or not
You can't be bothered to learn how to apply software performance engineering day-to-day in your job
....
Stack trace snapshots only allow you to see stroboscopic x-rays of your application. You may require more accumulated knowledge which a profiler may give you.
The trick is knowing your tools well and choose the best for the job at hand.
These must be some trivial examples that you are working with to get useful results with your method. I can't think of a project where profiling was useful (by whatever method) that would have gotten decent results with your "quick and effective" method. The time it takes to start and stop some applications already puts your assertion of "quick" in question.
Again, with non-trivial programs the method you advocate is useless.
EDIT:
Regarding "why isn't it better known"?
In my experience code reviews avoid poor quality code and algorithms, and profiling would find these as well. If you wish to continue with your method that is great - but I think for most of the professional community this is so far down on the list of things to try that it will never get positive reinforcement as a good use of time.
It appears to be quite inaccurate with small sample sets and to get large sample sets would take lots of time that would have been better spent with other useful activities.
What if the program is in production and being used at the same time by paying clients or colleagues. A profiler allows you to observe without interferring (as much, because of course it will have a little hit too as per the Heisenberg principle).
Profiling can also give you much richer and more detailed accurate reports. This will be quicker in the long run.
EDIT 2008/11/25: OK, Vineet's response has finally made me see what the issue is here. Better late than never.
Somehow the idea got loose in the land that performance problems are found by measuring performance. That is confusing means with ends. Somehow I avoided this by single-stepping entire programs long ago. I did not berate myself for slowing it down to human speed. I was trying to see if it was doing wrong or unnecessary things. That's how to make software fast - find and remove unnecessary operations.
Nobody has the patience for single-stepping these days, but the next best thing is to pick a number of cycles at random and ask what their reasons are. (That's what the call stack can often tell you.) If a good percentage of them don't have good reasons, you can do something about it.
It's harder these days, what with threading and asynchrony, but that's how I tune software - by finding unnecessary cycles. Not by seeing how fast it is - I do that at the end.
Here's why sampling the call stack cannot give a wrong answer, and why not many samples are needed.
During the interval of interest, when the program is taking more time than you would like, the call stack exists continuously, even when you're not sampling it.
If an instruction I is on the call stack for fraction P(I) of that time, removing it from the program, if you could, would save exactly that much. If this isn't obvious, give it a bit of thought.
If the instruction shows up on M = 2 or more samples, out of N, its P(I) is approximately M/N, and is definitely significant.
The only way you can fail to see the instruction is to magically time all your samples for when the instruction is not on the call stack. The simple fact that it is present for a fraction of the time is what exposes it to your probes.
So the process of performance tuning is a simple matter of picking off instructions (mostly function call instructions) that raise their heads by turning up on multiple samples of the call stack. Those are the tall trees in the forest.
Notice that we don't have to care about the call graph, or how long functions take, or how many times they are called, or recursion.
I'm against obfuscation, not against profilers. They give you lots of statistics, but most don't give P(I), and most users don't realize that that's what matters.
You can talk about forests and trees, but for any performance problem that you can fix by modifying code, you need to modify instructions, specifically instructions with high P(I). So you need to know where those are, preferably without playing Sherlock Holmes. Stack sampling tells you exactly where they are.
This technique is harder to employ in multi-thread, event-driven, or systems in production. That's where profilers, if they would report P(I), could really help.
Stepping through code is great for seeing the nitty-gritty details and troubleshooting algorithms. It's like looking at a tree really up close and following each vein of bark and branch individually.
Profiling lets you see the big picture, and quickly identify trouble points -- like taking a step backwards and looking at the whole forest and noticing the tallest trees. By sorting your function calls by length of execution time, you can quickly identify the areas that are the trouble points.
I used this method for Commodore 64 BASIC many years ago. It is surprising how well it works.
I've typically used it on real-time programs that were overrunning their timeslice. You can't manually stop and restart code that has to run 60 times every second.
I've also used it to track down the bottleneck in a compiler I had written. You wouldn't want to try to break such a program manually, because you really have no way of knowing if you are breaking at the spot where the bottlenck is, or just at the spot after the bottleneck when the OS is allowed back in to stop it. Also, what if the major bottleneck is something you can't do anything about, but you'd like to get rid of all the other largeish bottlenecks in the system? How to you prioritize which bottlenecks to attack first, when you don't have good data on where they all are, and what their relative impact each is?
The larger your program gets, the more useful a profiler will be. If you need to optimize a program which contains thousands of conditional branches, a profiler can be indispensible. Feed in your largest sample of test data, and when it's done import the profiling data into Excel. Then you check your assumptions about likely hot spots against the actual data. There are always surprises.

Resources