What is proper practice for performance rules testing?

What is proper practice for performance rules testing? - performance

I know that what we're doing is incorrect/strange practice.
We have an object that is constructed in many places in the app, and lags in its construction can severely impact our performance.
We want a gate to stop check-ins which affect this construction's performance too adversely...
So what we did was create a unit test which is basically the following:
myStopwatch.StartNew()
newMyObject = New myObject()
myStopwatch.Stop()
Assert(myStopwatch.ElapsedMilliseconds < 100)
Or: Fail if construction takes longer than 100ms
This "works" in the sense that check-ins will not commit if they impact this performance too negatively... However it's inherently a bad unit test because it can fail intermittently... If, for example, our build-server happens to be slow for whatever reason.
In response to some of the answers; we explicitly want our gates to reject check-ins that impact this performance, we don't want to check logs or watch for trends in data.
What is the correct way to meter performance in our check-in gate?

To avoid the machine dependence, you could first time the construction of a "reference object" which has a known acceptable construction time. Then compare the time to construct your object to the reference object's time.
This may help prevent false failures on an overloaded server since the reference code will also be slower. I'd also run the test several times and only require X% of them to pass. (since there are many external events which can slow down code, but none that will speed it up. )

First I would say: Can't you allow some of that logic be lazily run rather than executing all of it in the constructor / initialization? Or can you partition the object? An useful metric for this is LCOM4.
Secondly, can you cache those instances? In a previous project we had a similar situation, and we decided to cache the object for a few minutes. This brought some other smaller issues, but the performance of the app skyrocketed.
And last, I do think it's a good approach, but I would take an average, rathen than just one sample (the OS might just at that time decide to run something else and it might take more than 100ms).
Also, one issue with this approach is, if you update your hardware and forget to update this, you might add even more logic, without realizing.
I think a better approach, but more a bit more tricky to implement, is to store how long it takes to run N iterations, and if that value increases more than X% you fail the build. The benefit of this, is that since you store how long it takes, you can generate a graph from it and see the trend.

I don't think that you should really do this in such a way as to block check ins because it is too much work to be done during the check in process. Check ins need to be fast because your developers can do nothing else whilst they run.
This unit test would have to compile and run whilst the developer sits and waits for it. As you pointed out, one iteration of the test is not good enough to produce consistant results. How many times would it need to be run to be reliable? 10? A run of 10 iterations would increase the check in time by up to 1 second and still isn't reliable enough in my opinion. If you increased that to 100 iterations you'd get a better result but that's adding 10 seconds to the check in time.
Also, what happens if two developers check in code at the same time? Does the second one have to wait for the first test to complete before theirs starts or would the tests be run simultaneously? The first scenario is bad because the second developer has to wait twice as long. The second scenario is bad as you'd be likely to fail both tests.
I think that a better option would have the unit test be run after the check in has completed and, if it fails, have it communicate this to somebody. You could have the test run after each check in but that still has the potential for two people to check in at the same time. I think that it would be better to run the test every N minutes. That way you'd be able to track it down fairly quickly.
You could do it so that it blocks check ins but you'd have to make sure that it only runs when that object (or a dependancy) changes so that don't slow down every commit. You'd also have to make sure that the test isn't run more than once at a time.
As to the specific test, I don't think that you can get away with anything other than run the test through a number of iterations to get a more accurate result. I wouldn't like to rely on anything less than a 5 or 10 second test (so 50 to 100 iterations).

Related

What is the max time a function call should take to run in 2015?

I have a function that needs to run on every keypress and I want to know if it is doing too much work. To find that out I would time it. If it is less than a millisecond I would have to iterate over it 1k - 1 million times then divide by that amount.
While this may seem broad I'm going to ask anyway, what is the max amount of time a function should take to run in 2015? And is there a standard?
In video games if you want to hit 30 frames per second then you have 33 milliseconds to play with since 1 second equals 1000 milliseconds and 1000ms/30frames per second equals 33milliseconds. If you aim for 60 frames per second you only have 15 milliseconds to work with. So at what point would you say to your fellow software developer that their function takes too long? After, 1 millisecond? 100 milliseconds?
For example, if you were to use JQuery to find an element on a page of a 1000 elements how long should that function take? Is 100 ms too long? Is 1 ms too long? What standards to people use to go by? I've heard that database administrators have a standard search time for queries.
Update:
I really don't know how to word this question the way I meant it to. The reason I ask is because I do not know if I need to optimize a function or just move on. And if I do time a function I don't know if it's fast or slow. So, I'm going to ignore it if it's less than a millisecond and work on it if it's longer than a millisecond.
Update 2:
I found a post that describes a scenario. Look at the numbers in this question. It takes 3ms to query the database. If I was a database administrator I would want to know if that is taking too long to scale to a million users. I would want to know how long it should take to connect to the database or perform a query before adding another server to help load balance.

Okay. I think you're asking the wrong question. Or maybe thinking about the problem in a way that's not likely to yield productive results.
There's absolutely no rule that says, "your function should return after no more than X milliseconds". There are plenty of robust web applications that utilize functions that may not return for 250 milliseconds. That can be okay, depending on the context.
And, keep in mind that a function that runs for, say 3 milliseconds, on your dev machine, may run much faster or much slower on someone else's machine.
But here are some tips to get you thinking a little bit more clearly:
1) It's all about the user.
Really (and I kind of hesitate to say this because someone is going to take me too literally and either write bad code, or start a flame war with me), but as long as the performance of your code doesn't affect the user experience, your functions can take as long as you want to do their business. I'm not saying you should be lazy in your code optimization; I'm saying as long as the user doesn't perceive any delay, you don't really need to stress about it.
2) Ask your self two questions:
a) How often is the functions going to run? and b) How many times is it going to run in quick succession, with no breaks in between (i.e. is it going to run in a loop)?
In your example of an calling a function every time the user types a key, you can base the decision about "how long is too long to finish a function" on how often the user is going to hit a key. Again, if it doesn't fuss with the user's ability to use your application effectively, you're wasting time haggling over 3 milliseconds.
On the other hand, if your function is going to get called inside a loop that runs 100,000 times, then you definitely want to spend some time making it as lean as possible.
3) Utilize promises where appropriate.
Javascript promises are a nice feature (although they're not unique to javascript). Promises are way of saying, "listen, I don't know how long this function is going to take. Ima go start working on it, and I'll just let you know whenever I'm finished." Then, the rest of your code can keep on executing and whenever the promise is fulfilled, you are notified, and can do something at that point.
The most common example of promises is the AJAX design pattern. You never know how long a call to the back end might take, so you just say "alright, let me know when the backend responds with some useful information".
Read more about promises here
4) Benchmarking for fun and profit
As mentioned above, sometimes you really do need to shave off every last miserable millisecond. In those cases I've found jsperf.com to be very useful. They have made it easy to quickly test two pieces of code to see which one runs faster. There are tons of other benchmarking tools, but I like that one a lot. (I realize this is a bit of a tangent to the original post, but at some point somebody else is going to come a long and read this, so I feel it's relevant)
After all is said and done, remember to keep relating this question to the user. Optimize for the user, not for your ego, or for some arbitrary number.

Why BDD test scenarios are separated?

When I'm writing tests using lettuce, I want to create a huge scenario that contains an user making every possible action on website. But testing tools are making me aim separating them. What is the benefit of it?

You put "BDD" into the title of this question, and tag it with both "BDD" and "TDD" tags. So you're interested in Behavior Driven Development and Test Driven Development.
Why should you drive development a little bit at a time instead of driving the entire application all at once? That's what your question amounts to in the context of BDD and TDD.
You are going to write one method, one additional bit of functionality, next. Of course that bit will contribute to the overall behavior, and it's good to have an understanding of the overall behavior you're trying to develop, but you need focus. You need to know when that next bit is working and complete, so you can move on to the next bit. A full-blown test of the entire app will fail at the beginning; it will fail after your first bit of functionality is implemented; it will fail when you are half-done, and it will fail when you are 99% done. Unfortunately, it will probably also fail when you are 100% done - and now you'll have to find where you went wrong, and fix that bit (or those bits).
But if you write a test of just that new bit of functionality, it will fail right now, and it will pass in five minutes or ten minutes or twenty minutes or maybe an hour. Then you'll know it's time for the next bit. And it will keep passing as you write more tests and implement more functionality. When you're 99% done, you'll have 99% of your ultimate tests passing - and 100%, minus one, of your current tests. You can see your real progress and know that what you have written up to date is really working.
That's why you should write small tests, one at a time, and make them pass one at a time.

Three important things come to my mind:
readability: when a scenario fails, it is easier to understand from a first glance at the scenario's name what went wrong and significantly easier to fix when it's small and focused
maintainability: it is easier to modify/update small scenario
independence: large scenarios tend to make steps dependent on each other. This way, the further an action in a scenario is, the more it is depending on previous actions in more complicated ways that are hard to comprehend. This directly influences two previous reasons.

testing with automated testing

I am reading Test driven development by TDD by example KentBenk.
--->Stress--------$----->RunTests
|
|<------------$--------|
Above diagram says that if an arrow wiht $ means that an increase in the first node implies a decrease in the second node.
Above is positive feedback loop. The more stress you feel the less testing you do and more errors and more stress.
How do we get out of such a loop? Here authore mentioned that either introduce a new element, replace one of the elements, or change arrows. In this case we'll replace testing with automated testing.
Below is text notes after diagram:
Did I just break something else with that change? With automated
tests, when I start to feel stress, I run the tests. Running the tests
immediately gives me a good feeling and reduces the number of errors I
make, which further reduces the stress I feel.
"We don't have time to run the tests. Just release it!" The second
picture isn't guarantted. If the stress level raises high enough, it
breaks down. However, with automated tests you have to a chance to
choose your level of fear.
My questions are
Can any one represent new feed back with new element automated testing? Here how when I feel stress we run less automated tests with above diagram so how we reduce stress?
What does author mean by "The second picture isn't guarantted. If the stress level raises high enough, it breaks down. However, with automated tests you have to a chance to choose your level of fear."?

Don't have the book at hand.. but from your quoted passage.
Having quick automated tests makes it possible to
get fast feedback (within secs of making a change) - did you just break something that used to work?
run it frequently.. ideally after every tiny change. The faster your test suite the greater the probability that your tests will be run frequently.
The difference with manual testing is that the feedback cycle is too long causing you to bunch a lot of changes before you spend a day/week testing everything (you don't want to lose a day/week after every tiny change). This leads to issues with isolating the change when a defect is found - more stress.

Unit testing: is "good enough" good enough?

I have a unit test that can spontaneously fail 1 in 1,000,000 (guesstimation) times even when there is no fault in the code. Is this an acceptable tolerance or does the TDD manifesto requires iron fisted absoluteness?
Just for those who are interested it goes something like this
stuff = randomCrap.get()
stuff2 = randomCrap.get()
assert(stuff != stuff2)

Well, that really depends on the source of the failure. Do you know why it fails? If so, have you tried to isolate that fault so that it doesn't trip up the unit test?
Personally, I'd say that if it's genuinely 1 in a million and you know why it's happening, then add a comment to that effect and don't worry about it. It's not likely to bother people significantly in a continuous build, after all. Of course, if it's really one in ten, or something like that, that's a very different matter.
I would at least try to remove the source of incorrectness though. For one thing, it suggests your test isn't repeatable. Sometimes that's okay - there are some sources of randomness which are very difficult to extract out - but I would try not to do it. If you've tried and reached a block, then the pragmatic thing to do is document it and move on, IMO.

The question isn't "does TDD permit me to allow an occasionally failing test?" but, rather: "do my system's requirements permit occasional failures?" And that's a question only you can answer.
From the TDD perspective, it's clumsy to have tests which fail occasionally - you don't know, when they fail, whether it's because this is one of those rare permissible failures, or whether it's because your code is broken in an unacceptable way. So an occasionally failing test is significantly less useful to you than one which always passes.
If your requirement is to have a different behavior one time out of a million, then you should test to that requirement. Test the general case, not with a random number, but with a meaningful subset of valid inputs. Test the special case with the value that should bring about the special behavior.

What is your requirement for how often before this error should happen?
Are you comfortable, and your functional users, with what is happening?
You may want to make certain it is written down that your random generation will repeat, sometimes in duplicate requests.
To me that is the more troubling part, that it can happen as you showed above, and I would think that this is something to look into.

A unit test cannot fail if there is no fault in the code. There is a fault, be it related to timing, networking, etc... you just haven't figured it out yet. Once you do figure it out, you just learned something that you didn't know. +1 to you.
The real problem if it is not fixed is psychological. That method will tend to get blamed whenever there is something strange/random/unexplained that happens in the system. Better to fix the red herring now when you are thinking about it.
And FYI, randomness does not imply unique.

Do you absolutely have to use random data? Are you testing a system that is made to return two distinct random values?
Otherwise create a stub.
Unit tests should be 100% repeatable, this is hard to do when for example using threads of file system stuff, but that is what stubs are for.

Your test asserts that stuff1 is never equal to stuff2, but seeing the test fail occasionally means that this is not true.
You might be able to assert that stuff1 is occasionally equal to stuff2 by taking a million samples and asserting that the frequency of being equal is less than 10, but still this will fail occasionally - but perhaps much less often which might be acceptable.
You might be better off with:
stuff = 4
stuff2 = 5
assert(stuff != stuff2)
You can be pretty certain that the above code will perform the same as your original code once in a million million times - but you are certain that this code will pass every time!

One could use a profiler, but why not just halt the program? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
If something is making a single-thread program take, say, 10 times as long as it should, you could run a profiler on it. You could also just halt it with a "pause" button, and you'll see exactly what it's doing.
Even if it's only 10% slower than it should be, if you halt it more times, before long you'll see it repeatedly doing the unnecessary thing. Usually the problem is a function call somewhere in the middle of the stack that isn't really needed. This doesn't measure the problem, but it sure does find it.
Edit: The objections mostly assume that you only take 1 sample. If you're serious, take 10. Any line of code causing some percentage of wastage, like 40%, will appear on the stack on that fraction of samples, on average. Bottlenecks (in single-thread code) can't hide from it.
EDIT: To show what I mean, many objections are of the form "there aren't enough samples, so what you see could be entirely spurious" - vague ideas about chance. But if something of any recognizable description, not just being in a routine or the routine being active, is in effect for 30% of the time, then the probability of seeing it on any given sample is 30%.
Then suppose only 10 samples are taken. The number of times the problem will be seen in 10 samples follows a binomial distribution, and the probability of seeing it 0 times is .028. The probability of seeing it 1 time is .121. For 2 times, the probability is .233, and for 3 times it is .267, after which it falls off. Since the probability of seeing it less than two times is .028 + .121 = .139, that means the probability of seeing it two or more times is 1 - .139 = .861. The general rule is if you see something you could fix on two or more samples, it is worth fixing.
In this case, the chance of seeing it in 10 samples is 86%. If you're in the 14% who don't see it, just take more samples until you do. (If the number of samples is increased to 20, the chance of seeing it two or more times increases to more than 99%.) So it hasn't been precisely measured, but it has been precisely found, and it's important to understand that it could easily be something that a profiler could not actually find, such as something involving the state of the data, not the program counter.

On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.
This technique can reveal more performance problems in 2 minutes than any other technique I know of.

Because sometimes it works, and sometimes it gives you completely wrong answers. A profiler has a far better record of finding the right answer, and it usually gets there faster.

Doing this manually can't really be called "quick" or "effective", but there are several profiling tools which do this automatically; also known as statistical profiling.

Callstack sampling is a very useful technique for profiling, especially when looking at a large, complicated codebase that could be spending its time in any number of places. It has the advantage of measuring the CPU's usage by wall-clock time, which is what matters for interactivity, and getting callstacks with each sample lets you see why a function is being called. I use it a lot, but I use automated tools for it, such as Luke Stackwalker and OProfile and various hardware-vendor-supplied things.
The reason I prefer automated tools over manual sampling for the work I do is statistical power. Grabbing ten samples by hand is fine when you've got one function taking up 40% of runtime, because on average you'll get four samples in it, and always at least one. But you need more samples when you have a flat profile, with hundreds of leaf functions, none taking more than 1.5% of the runtime.
Say you have a lake with many different kinds of fish. If 40% of the fish in the lake are salmon (and 60% "everything else"), then you only need to catch ten fish to know there's a lot of salmon in the lake. But if you have hundreds of different species of fish, and each species is individually no more than 1%, you'll need to catch a lot more than ten fish to be able to say "this lake is 0.8% salmon and 0.6% trout."
Similarly in the games I work on, there are several major systems each of which call dozens of functions in hundreds of different entities, and all of this happens 60 times a second. Some of those functions' time funnels into common operations (like malloc), but most of it doesn't, and in any case there's no single leaf that occupies more than 1000 μs per frame.
I can look at the trunk functions and see, "we're spending 10% of our time on collision", but that's not very helpful: I need to know exactly where in collision, so I know which functions to squeeze. Just "do less collision" only gets you so far, especially when it means throwing out features. I'd rather know "we're spending an average 600 μs/frame on cache misses in the narrow phase of the octree because the magic missile moves so fast and touches lots of cells," because then I can track down the exact fix: either a better tree, or slower missiles.
Manual sampling would be fine if there were a big 20% lump in, say, stricmp, but with our profiles that's not the case. Instead I have hundreds of functions that I need to get from, say, 0.6% of frame to 0.4% of frame. I need to shave 10 μs off every 50 μs function that is called 300 times per second. To get that kind of precision, I need more samples.
But at heart what Luke Stackwalker does is what you describe: every millisecond or so, it halts the program and records the callstack (including the precise instruction and line number of the IP). Some programs just need tens of thousands of samples to be usefully profiled.
(We've talked about this before, of course, but I figured this was a good place to summarize the debate.)

There's a difference between things that programmers actually do, and things that they recommend others do.
I know of lots of programmers (myself included) that actually use this method. It only really helps to find the most obvious of performance problems, but it's quick and dirty and it works.
But I wouldn't really tell other programmers to do it, because it would take me too long to explain all the caveats. It's far too easy to make an inaccurate conclusion based on this method, and there are many areas where it just doesn't work at all. (for example, that method doesn't reveal any code that is triggered by user input).
So just like using lie detectors in court, or the "goto" statement, we just don't recommend that you do it, even though they all have their uses.

I'm surprised by the religous tone on both sides.
Profiling is great, and certainly is a more refined and precise when you can do it. Sometimes you can't, and it's nice to have a trusty back-up. The pause technique is like the manual screwdriver you use when your power tool is too far away or the bateries have run-down.
Here is a short true story. An application (kind of a batch proccessing task) had been running fine in production for six months, suddenly the operators are calling developers because it is going "too slow". They aren't going to let us attach a sampling profiler in production! You have to work with the tools already installed. Without stopping the production process, just using Process Explorer, (which operators had already installed on the machine) we could see a snapshot of a thread's stack. You can glance at the top of the stack, dismiss it with the enter key and get another snapshot with another mouse click. You can easily get a sample every second or so.
It doesn't take long to see if the top of the stack is most often in the database client library DLL (waiting on the database), or in another system DLL (waiting for a system operation), or actually in some method of the application itself. In this case, if I remember right, we quickly noticed that 8 times out of 10 the application was in a system DLL file call reading or writing a network file. Sure enough recent "upgrades" had changed the performance characteristics of a file share. Without a quick and dirty and (system administrator sanctioned) approach to see what the application was doing in production, we would have spent far more time trying to measure the issue, than correcting the issue.
On the other hand, when performance requirements move beyond "good enough" to really pushing the envelope, a profiler becomes essential so that you can try to shave cycles from all of your closely-tied top-ten or twenty hot spots. Even if you are just trying to hold to a moderate performance requirement durring a project, when you can get the right tools lined-up to help you measure and test, and even get them integrated into your automated test process it can be fantasticly helpful.
But when the power is out (so to speak) and the batteries are dead, it's nice know how to use that manual screwdriver.
So the direct answer: Know what you can learn from halting the program, but don't be afraid of precision tools either. Most importantly know which jobs call for which tools.

Hitting the pause button during the execution of a program in "debug" mode might not provide the right data to perform any performance optimizations. To put it bluntly, it is a crude form of profiling.
If you must avoid using a profiler, a better bet is to use a logger, and then apply a slowdown factor to "guesstimate" where the real problem is. Profilers however, are better tools for guesstimating.
The reason why hitting the pause button in debug mode, may not give a real picture of application behavior is because debuggers introduce additional executable code that can slowdown certain parts of the application. One can refer to Mike Stall's blog post on possible reasons for application slowdown in a debugging environment. The post sheds light on certain reasons like too many breakpoints,creation of exception objects, unoptimized code etc. The part about unoptimized code is important - the "debug" mode will result in a lot of optimizations (usually code in-lining and re-ordering) being thrown out of the window, to enable the debug host (the process running your code) and the IDE to synchronize code execution. Therefore, hitting pause repeatedly in "debug" mode might be a bad idea.

If we take the question "Why isn't it better known?" then the answer is going to be subjective. Presumably the reason why it is not better known is because profiling provides a long term solution rather than a current problem solution. It isn't effective for multi-threaded applications and isn't effective for applications like games which spend a significant portion of its time rendering.
Furthermore, in single threaded applications if you have a method that you expect to consume the most run time, and you want to reduce the run-time of all other methods then it is going to be harder to determine which secondary methods to focus your efforts upon first.
Your process for profiling is an acceptable method that can and does work, but profiling provides you with more information and has the benefit of showing you more detailed performance improvements and regressions.
If you have well instrumented code then you can examine more than just the how long a particular method; you can see all the methods.
With profiling:
You can then rerun your scenario after each change to determine the degree of performance improvement/regression.
You can profile the code on different hardware configurations to determine if your production hardware is going to be sufficient.
You can profile the code under load and stress testing scenarios to determine how the volume of information impacts performance
You can make it easier for junior developers to visualise the impacts of their changes to your code because they can re-profile the code in six months time while you're off at the beach or the pub, or both. Beach-pub, ftw.
Profiling is given more weight because enterprise code should always have some degree of profiling because of the benefits it gives to the organisation of an extended period of time. The more important the code the more profiling and testing you do.
Your approach is valid and is another item is the toolbox of the developer. It just gets outweighed by profiling.

Sampling profilers are only useful when
You are monitoring a runtime with a small number of threads. Preferably one.
The call stack depth of each thread is relatively small (to reduce the incredible overhead in collecting a sample).
You are only concerned about wall clock time and not other meters or resource bottlenecks.
You have not instrumented the code for management and monitoring purposes (hence the stack dump requests)
You mistakenly believe removing a stack frame is an effective performance improvement strategy whether the inherent costs (excluding callees) are practically zero or not
You can't be bothered to learn how to apply software performance engineering day-to-day in your job
....

Stack trace snapshots only allow you to see stroboscopic x-rays of your application. You may require more accumulated knowledge which a profiler may give you.
The trick is knowing your tools well and choose the best for the job at hand.

These must be some trivial examples that you are working with to get useful results with your method. I can't think of a project where profiling was useful (by whatever method) that would have gotten decent results with your "quick and effective" method. The time it takes to start and stop some applications already puts your assertion of "quick" in question.
Again, with non-trivial programs the method you advocate is useless.
EDIT:
Regarding "why isn't it better known"?
In my experience code reviews avoid poor quality code and algorithms, and profiling would find these as well. If you wish to continue with your method that is great - but I think for most of the professional community this is so far down on the list of things to try that it will never get positive reinforcement as a good use of time.
It appears to be quite inaccurate with small sample sets and to get large sample sets would take lots of time that would have been better spent with other useful activities.

What if the program is in production and being used at the same time by paying clients or colleagues. A profiler allows you to observe without interferring (as much, because of course it will have a little hit too as per the Heisenberg principle).
Profiling can also give you much richer and more detailed accurate reports. This will be quicker in the long run.

EDIT 2008/11/25: OK, Vineet's response has finally made me see what the issue is here. Better late than never.
Somehow the idea got loose in the land that performance problems are found by measuring performance. That is confusing means with ends. Somehow I avoided this by single-stepping entire programs long ago. I did not berate myself for slowing it down to human speed. I was trying to see if it was doing wrong or unnecessary things. That's how to make software fast - find and remove unnecessary operations.
Nobody has the patience for single-stepping these days, but the next best thing is to pick a number of cycles at random and ask what their reasons are. (That's what the call stack can often tell you.) If a good percentage of them don't have good reasons, you can do something about it.
It's harder these days, what with threading and asynchrony, but that's how I tune software - by finding unnecessary cycles. Not by seeing how fast it is - I do that at the end.
Here's why sampling the call stack cannot give a wrong answer, and why not many samples are needed.
During the interval of interest, when the program is taking more time than you would like, the call stack exists continuously, even when you're not sampling it.
If an instruction I is on the call stack for fraction P(I) of that time, removing it from the program, if you could, would save exactly that much. If this isn't obvious, give it a bit of thought.
If the instruction shows up on M = 2 or more samples, out of N, its P(I) is approximately M/N, and is definitely significant.
The only way you can fail to see the instruction is to magically time all your samples for when the instruction is not on the call stack. The simple fact that it is present for a fraction of the time is what exposes it to your probes.
So the process of performance tuning is a simple matter of picking off instructions (mostly function call instructions) that raise their heads by turning up on multiple samples of the call stack. Those are the tall trees in the forest.
Notice that we don't have to care about the call graph, or how long functions take, or how many times they are called, or recursion.
I'm against obfuscation, not against profilers. They give you lots of statistics, but most don't give P(I), and most users don't realize that that's what matters.
You can talk about forests and trees, but for any performance problem that you can fix by modifying code, you need to modify instructions, specifically instructions with high P(I). So you need to know where those are, preferably without playing Sherlock Holmes. Stack sampling tells you exactly where they are.
This technique is harder to employ in multi-thread, event-driven, or systems in production. That's where profilers, if they would report P(I), could really help.

Stepping through code is great for seeing the nitty-gritty details and troubleshooting algorithms. It's like looking at a tree really up close and following each vein of bark and branch individually.
Profiling lets you see the big picture, and quickly identify trouble points -- like taking a step backwards and looking at the whole forest and noticing the tallest trees. By sorting your function calls by length of execution time, you can quickly identify the areas that are the trouble points.

I used this method for Commodore 64 BASIC many years ago. It is surprising how well it works.

I've typically used it on real-time programs that were overrunning their timeslice. You can't manually stop and restart code that has to run 60 times every second.
I've also used it to track down the bottleneck in a compiler I had written. You wouldn't want to try to break such a program manually, because you really have no way of knowing if you are breaking at the spot where the bottlenck is, or just at the spot after the bottleneck when the OS is allowed back in to stop it. Also, what if the major bottleneck is something you can't do anything about, but you'd like to get rid of all the other largeish bottlenecks in the system? How to you prioritize which bottlenecks to attack first, when you don't have good data on where they all are, and what their relative impact each is?

The larger your program gets, the more useful a profiler will be. If you need to optimize a program which contains thousands of conditional branches, a profiler can be indispensible. Feed in your largest sample of test data, and when it's done import the profiling data into Excel. Then you check your assumptions about likely hot spots against the actual data. There are always surprises.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio