How do you reason about fluctuations in benchmarking data?

How do you reason about fluctuations in benchmarking data? - performance

Suppose you're trying to optimize a function and using some benchmarking framework (like Google Benchmark) for measurement. You run the benchmarks on the original function 3 times and see average wall clock time/CPU times of 100 ms, 110 ms, 90 ms. Then you run the benchmarks on the "optimized" function 3 times and see 80 ms, 95 ms, 105 ms. (I made these numbers up). Do you conclude that your optimizations were successful?
Another problem I often run into is that I'll go do something else and run the benchmarks later in the day and get numbers that are further away than the delta between the original and optimized earlier in the day (say, 80 ms, 85 ms, 75 ms for the original function).
I know there are statistical methods to determine whether the improvement is "significant". Do software engineers actually use these formal calculations in practice?
I'm looking for some kind of process to follow when optimizing code.

Rule of Thumb
Minimum(!) of each series => 90ms vs 80ms
Estimate noise => ~ 10ms
Pessimism => It probably didn't get any slower.
Not happy yet?
Take more measurements. (~13 runs each)
Interleave the runs. (Don't measure 13x A followed by 13x B.)
Ideally you always randomize whether you run A or B next (scientific: randomized trial), but it's probably overkill. Any source of error should affect each variant with the same probability. (Like the CPU building up heat over time, or a background task starting after run 11.)
Go back to step 1.
Still not happy? Time to admit it that you've been nerd-sniped. The difference, if it exists, is so small that you can't even measure it. Pick the more readable variant and move on. (Or alternatively, lock your CPU frequency, isolate a core just for the test, quiet down your system...)
Explanation
Minimum: Many people (and tools, even) take the average, but the minimum is statistically more stable. There is a lower limit how fast your benchmark can run on a given hardware, but no upper limit much it can get slowed down by other programs. Also, taking the minimum will automatically drop the initial "warm-up" run.
Noise: Apply common sense, just glance over the numbers. If you look a the standard deviation, make that look very skeptical! A single outlier will influence it so much that it becomes nearly useless. (It's not a normal distribution, usually.)
Pessimism: You were really clever to find this optimization, you really want the optimized version to be faster! If it looks better just by chance, you will believe it. (You knew it!) So if you care about being correct, you must counter this tendency.
Disclaimer
Those are just basic guidelines. Worst-case latency is relevant in some applications (smooth animations or motor control), but it will be harder to measure. It's easy (and fun!) to optimize something that doesn't matter in practice. Instead of wondering if your 1% gain is statistically significant, try something else. Measure the full program including OS overhead. Comment out code, or run work twice, only to check if optimizing it might be worth it.

Do you conclude that your optimizations were successful?
No. 3 runs is not enough especially due to the huge variation and the fact that some timings of the two groups are mixed once merged and sorted.
For small timings like this, the first run should be removed and at least dozens of runs should be performed. I would personally use at least hundreds of runs.
Do software engineers actually use these formal calculations in practice?
Only very few developers does advanced statistical analysis. It is often not needed to do something very formal when the gab before/after the target optimization is huge and the variation within groups is small.
For example, if your program is twice faster than before with a min-max variation of <5%, then you can quite safely say that the optimization is successful. That being said, it is sometimes not the case due to unexpected external factors (though it is very rare when the gap is so big).
If the result is not obvious, then you need to do some statistic basics. You need to compute the standard deviation, the mean and median time, remove the first run, interleave runs and use many runs (at least dozens). The distribution of the timings almost always follow a normal distribution due to the central limit theorem. It is sometimes a mixture distribution due to the threshold effects (eg. caching). You can plot the value to see that easily if you see some outliers in timings.
If there are threshold effects, then you need to apply an advanced statistical analysis but this is complex to do and generally it is not an expected behaviour. I is generally a sign that the benchmark is biased, there is a bug or a complex effect you have to consider during the analysis of the result anyway. Thus, I strongly advise you to fix/mitigate the problem before analysing the results in that case.
Assuming the timings follow a normal distribution, you can just check if the median is close to the mean and if the standard deviation is small compare to the gap between the mean.
A more formal way to do that is to compute the Student t-test and its associated p-value and check the significance of the p-value (eg. <5%). If there are more groups, An Anova can be used. If you are unsure about the distribution, you can apply non-parametric statistical tests like the Wilcoxon and Kruskal-Wallis tests (note that the statistical power of these test is not the same). In practice, doing such a formal analysis is time-consuming and it is generally not so useful compare to a naive basic check (using the mean and standard deviation) unless your modification impacts a lot of users or you plan to write research papers.
Keep in mind that using a good statistical analysis does not prevent biased benchmarks. You need to minimize the external factors that can cause biased results. One frequent bias is frequency scaling: the first benchmark can be faster than the second because of turbo-boost or it can be slower because the processor can take some time to reach a high frequency. Caches also plays a huge role in benchmark biases. There are many other factors that can cause biases in practice like the compiler/runtime versions, environment variables, configuration files, OS/driver updates, memory alignment, OS paging (especially on NUMA systems), the hardware (eg. thermal throttling), software bugs (it is not rare to find bugs by analysing strange performance behaviours), etc.
As a result, it is critical to make benchmarks as reproducible as possible (by fixing versions and reporting the environment parameters (as well as possibly run the benchmarks in a sandbox if you are paranoid and if it does not affect too much the timings). Software like Nix/Spack help for packaging, and containers like LXD, Docker could help for a more reproducible environment.
Many big software team use automated benchmarking to check the presence of performance regression. Tools can do the run properly and statistical analysis for you regularly. A good example is the Numpy team which use a package called Airspeed Velocity (see the results). The PyPy team also designed their own benchmarking tool. The Linux kernel also have benchmarking suite to check for regression (eg. PTS) and many company focusing on performance have such automated benchmarking tools (often home-made). There are many existing tools for that.
For more information about this topic, please give a look to the great Performance Matters presentation by Emery Berger.

Related

How to accurately measure performance of sorting algorithms

I have a bunch of sorting algorithms in C I wish to benchmark. I am concerned regarding good methodology for doing so. Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method. How do I minimize the effect of said variables on the benchmark's results?
To give you a few examples, I've considered multiple implementations on two different languages to adjust for the first two variables. Moreover I could compile the code with different compilers on fairly mundane (and specified) arguments. Now I'm going to be running the test on my machine, which features turbo boost and whatnot and often boosts a core running stuff to the moon. Of course I will be disabling that and doing multiple runs and likely taking their mean completion time to adjust for that as well. Regarding the input data, I will be taking different array sizes, from very small to relatively large. I do not know what the increments should ideally be like, and what the range of the elements should be as well. Also I presume duplicate elements should be allowed.
I know that theoretical analysis of algorithms accounts for all of these methods, but it is crucial that I complement my study with actual benchmarks. How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected? I'm comfortable with the technologies I'm working with, less so with strict methodology for studying a topic. Thank you.

You can't benchmark abstract algorithms, only specific implementations of them, compiled with specific compilers running on specific machines.
Choose a couple different relevant compilers and machines (e.g. a Haswell, Ice Lake, and/or Zen2, and an Apple M1 if you can get your hands on one, and/or an AArch64 cloud server) and measure your real implementations. If you care about in-order CPUs like ARM Cortex-A53, measure on one of those, too. (Simulation with GEM5 or similar performance simulators might be worth trying. Also maybe relevant are low-power implementations like Intel Silvermont whose out-of-order window is much smaller, but also have a shorter pipeline so smaller branch mispredict penalty.)
If some algorithm allows a useful micro-optimization in the source, or that a compiler finds, that's a real advantage of that algorithm.
Compile with options you'd use in practice for the use-cases you care about, like clang -O3 -march=native, or just -O2.
Benchmarking on cloud servers makes it hard / impossible to get an idle system, unless you pay a lot for a huge instance, but modern AArch64 servers are relevant and may have different ratios of memory bandwidth vs. branch mispredict costs vs. cache sizes and bandwidths.
(You might well find that the same code is the fastest sorting implementation on all or most of the systems you test one.
Re: sizes: yes, a variety of sizes would be good.
You'll normally want to test with random data, perhaps always generated from the same PRNG seed so you're sorting the same data every time.
You may also want to test some unusual cases like already-sorted or almost-sorted, because algorithms that are extra fast for those cases are useful.
If you care about sorting things other than integers, you might want to test with structs of different sizes, with an int key as a member. Or a comparison function that does some amount of work, if you want to explore how sorts do with a compare function that isn't as simple as just one compare machine instruction.
As always with microbenchmarking, there are many pitfalls around warm-up of arrays (page faults) and CPU frequency, and more. Idiomatic way of performance evaluation?
taking their mean completion time
You might want to discard high outliers, or take the median which will have that effect for you. Usually that means "something happened" during that run to disturb it. If you're running the same code on the same data, often you can expect the same performance. (Randomization of code / stack addresses with page granularity usually doesn't affect branches aliasing each other in predictors or not, or data-cache conflict misses, but tiny changes in one part of the code can change performance of other code via effects like that if you're re-compiling.)
If you're trying to see how it would run when it has the machine to itself, you don't want to consider runs where something else interfered. If you're trying to benchmark under "real world" cloud server conditions, or with other threads doing other work in a real program, that's different and you'd need to come up with realistic other loads that use some amount of shared resources like L3 footprint and memory bandwidth.

Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method.
Let's look at this from a very different perspective - how to present information to humans.
With 2 variables you get a nice 2-dimensional grid of results, maybe like this:
A = 1 A = 2
B = 1 4 seconds 2 seconds
B = 2 6 seconds 3 seconds
This is easy to display and easy for humans to understand and draw conclusions from (e.g. from my silly example table it's trivial to make 2 very different observations - "A=1 is twice as fast as A=2 (regardless of B)" and "B=1 is faster than B=2 (regardless of A)").
With 3 variables you get a 3-dimensional grid of results, and with N variables you get an N-dimensional grid of results. Humans struggle with "3-dimensional data on 2-dimensional screen" and more dimensions becomes a disaster. You can mitigate this a little by "peeling off" a dimension (e.g. instead of trying to present a 3D grid of results you could show multiple 2D grids); but that doesn't help humans much.
Your primary goal is to reduce the number of variables.
To reduce the number of variables:
a) Determine how important each variable is for what you intend to observe (e.g. "which algorithm" will be extremely important and "which language" will be less important).
b) Merge variables based on importance and "logical grouping". For example, you might get three "lower importance" variables (language, compiler, compiler options) and merge them into a single "language+compiler+options" variable.
Note that it's very easy to overlook a variable. For example, you might benchmark "algorithm 1" on one computer and benchmark "algorithm 2" on an almost identical computer, but overlook the fact that (even though both benchmarks used identical languages, compilers, compiler options and CPUs) one computer has faster RAM chips, and overlook "RAM speed" as a possible variable.
Your secondary goal is to reduce number of values each variable can have.
You don't want massive table/s with 12345678 million rows; and you don't want to spend the rest of your life benchmarking to generate such a large table.
To reduce the number of values each variable can have:
a) Figure out which values matter most
b) Select the right number of values in order of importance (and ignore/skip all other values)
For example, if you merged three "lower importance" variables (language, compiler, compiler options) into a single variable; then you might decide that 2 possibilities ("C compiled by GCC with -O3" and "C++ compiled by MSVC with -Ox") are important enough to worry about (for what you're intending to observe) and all of the other possibilities get ignored.
How do I minimize the effect of said variables on the benchmark's results?
How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected?
By identifying the variables (as part of the primary goal) and explicitly deciding which values the variables may have (as part of the secondary goal).
You've already been doing this. What I've described is a formal method of doing what people would unconsciously/instinctively do anyway. For one example, you have identified that "turbo boost" is a variable, and you've decided that "turbo boost disabled" is the only value for that variable you care about (but do note that this may have consequences - e.g. consider "single-threaded merge sort without the turbo boost it'd likely get in practice" vs. "parallel merge sort that isn't as influenced by turning turbo boost off").
My hope is that by describing the formal method you gain confidence in the unconscious/instinctive decisions you're already making, and realize that you were very much on the right path before you asked the question.

Preventing performance regressions in R

What is a good workflow for detecting performance regressions in R packages? Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
What is a good workflow in general? What other languages provide good tools? Is it something that can be built on top unit testing, or that is usually done separately?

This is a very challenging question, and one that I'm frequently dealing with, as I swap out different code in a package to speed things up. Sometimes a performance regression comes along with a change in algorithms or implementation, but it may also arise due to changes in the data structures used.
What is a good workflow for detecting performance regressions in R packages?
In my case, I tend to have very specific use cases that I'm trying to speed up, with different fixed data sets. As Spacedman wrote, it's important to have a fixed computing system, but that's almost infeasible: sometimes a shared computer may have other processes that slow things down 10-20%, even when it looks quite idle.
My steps:
Standardize the platform (e.g. one or a few machines, a particular virtual machine, or a virtual machine + specific infrastructure, a la Amazon's EC2 instance types).
Standardize the data set that will be used for speed testing.
Create scripts and fixed intermediate data output (i.e. saved to .rdat files) that involve very minimal data transformations. My focus is on some kind of modeling, rather than data manipulation or transformation. This means that I want to give exactly the same block of data to the modeling functions. If, however, data transformation is the goal, then be sure that the pre-transformed/manipulated data is as close as possible to standard across tests of different versions of the package. (See this question for examples of memoization, cacheing, etc., that can be used to standardize or speed up non-focal computations. It references several packages by the OP.)
Repeat tests multiple times.
Scale the results relative to fixed benchmarks, e.g. the time to perform a linear regression, to sort a matrix, etc. This can allow for "local" or transient variations in infrastructure, such as may be due to I/O, the memory system, dependent packages, etc.
Examine the profiling output as vigorously as possible (see this question for some insights, also referencing tools from the OP).
Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
Unfortunately, I don't have an answer for this.
What is a good workflow in general?
For me, it's quite similar to general dynamic code testing: is the output (execution time in this case) reproducible, optimal, and transparent? Transparency comes from understanding what affects the overall time. This is where Mike Dunlavey's suggestions are important, but I prefer to go further, with a line profiler.
Regarding a line profiler, see my previous question, which refers to options in Python and Matlab for other examples. It's most important to examine clock time, but also very important to track memory allocation, number of times the line is executed, and call stack depth.
What other languages provide good tools?
Almost all other languages have better tools. :) Interpreted languages like Python and Matlab have the good & possibly familiar examples of tools that can be adapted for this purpose. Although dynamic analysis is very important, static analysis can help identify where there may be some serious problems. Matlab has a great static analyzer that can report when objects (e.g. vectors, matrices) are growing inside of loops, for instance. It is terrible to find this only via dynamic analysis - you've already wasted execution time to discover something like this, and it's not always discernible if your execution context is pretty simple (e.g. just a few iterations, or small objects).
As far as language-agnostic methods, you can look at:
Valgrind & cachegrind
Monitoring of disk I/O, dirty buffers, etc.
Monitoring of RAM (Cachegrind is helpful, but you could just monitor RAM allocation, and lots of details about RAM usage)
Usage of multiple cores
Is it something that can be built on top unit testing, or that is usually done separately?
This is hard to answer. For static analysis, it can occur before unit testing. For dynamic analysis, one may want to add more tests. Think of it as sequential design (i.e. from an experimental design framework): if the execution costs appear to be, within some statistical allowances for variation, the same, then no further tests are needed. If, however, method B seems to have an average execution cost greater than method A, then one should perform more intensive tests.
Update 1: If I may be so bold, there's another question that I'd recommend including, which is: "What are some gotchas in comparing the execution time of two versions of a package?" This is analogous to assuming that two programs that implement the same algorithm should have the same intermediate objects. That's not exactly true (see this question - not that I'm promoting my own questions, here - it's just hard work to make things better and faster...leading to multiple SO questions on this topic :)). In a similar way, two executions of the same code can differ in time consumed due to factors other than the implementation.
So, some gotchas that can occur, either within the same language or across languages, within the same execution instance or across "identical" instances, which can affect runtime:
Garbage collection - different implementations or languages can hit garbage collection under different circumstances. This can make two executions appear different, though it can be very dependent on context, parameters, data sets, etc. The GC-obsessive execution will look slower.
Cacheing at the level of the disk, motherboard (e.g. L1, L2, L3 caches), or other levels (e.g. memoization). Often, the first execution will pay a penalty.
Dynamic voltage scaling - This one sucks. When there is a problem, this may be one of the hardest beasties to find, since it can go away quickly. It looks like cacheing, but it isn't.
Any job priority manager that you don't know about.
One method uses multiple cores or does some clever stuff about how work is parceled among cores or CPUs. For instance, getting a process locked to a core can be useful in some scenarios. One execution of an R package may be luckier in this regard, another package may be very clever...
Unused variables, excessive data transfer, dirty caches, unflushed buffers, ... the list goes on.
The key result is: Ideally, how should we test for differences in expected values, subject to the randomness created due to order effects? Well, pretty simple: go back to experimental design. :)
When the empirical differences in execution times are different from the "expected" differences, it's great to have enabled additional system and execution monitoring so that we don't have to re-run the experiments until we're blue in the face.

The only way to do anything here is to make some assumptions. So let us assume an unchanged machine, or else require a 'recalibration'.
Then use a unit-test alike framework, and treat 'has to be done in X units of time' as just yet another testing criterion to be fulfilled. In other words, do something like
stopifnot( timingOf( someExpression ) < savedValue plus fudge)
so we would have to associate prior timings with given expressions. Equality-testing comparisons from any one of the three existing unit testing packages could be used as well.
Nothing that Hadley couldn't handle so I think we can almost expect a new package timr after the next long academic break :). Of course, this has to be either be optional because on a "unknown" machine (think: CRAN testing the package) we have no reference point, or else the fudge factor has to "go to 11" to automatically accept on a new machine.

A recent change announced on the R-devel feed could give a crude measure for this.
CHANGES IN R-devel UTILITIES
‘R CMD check’ can optionally report timings on various parts of the check: this is controlled by environment variables documented in ‘Writing R Extensions’.
See http://developer.r-project.org/blosxom.cgi/R-devel/2011/12/13#n2011-12-13
The overall time spent running the tests could be checked and compared to previous values. Of course, adding new tests will increase the time, but dramatic performance regressions could still be seen, albeit manually.
This is not as fine grained as timing support within individual test suites, but it also does not depend on any one specific test suite.

When is performance gain significant enough to implement that optimization?

following the text book, I do measure performance whenever I try optimizing my code. Sometimes, however, the performance gain is rather small and I can't decisively decide whether I should implement that optimization.
For example, when a fix shortens an average response time of 100ms to 90ms under some conditions, should I implement that fix? What if it shortens 200ms to 190ms? How many condition should I try before I can conclude that it will be beneficial overall?
I guess it's not possible to give a straight forward answer to this, as it depends on too many things, but is there a good rule of thumb that I should follow? Are there any guideline/best-practices?
EDIT:Thanks for the great answers! I guess the moral of the story is, there is no easy way to tell whether you should, but there ARE guidelines that can aid that process.. Things you should consider, things you shouldn't do etc. This particular time I ended up implementing the fix, even though it made a few line of code into 20-30 lines of code. Because our app. is very performance critical, and it was a consistent 10% gain in various realistic cases.

I think the rule of thumb (at least for me) is two-fold:
"It matters if it matters"--in the business world, this generally means that it matters if the clients care. That is, if the end users will "notice" the difference between 100ms and 90ms (I'm not being facetious here), then it matters.
If "it matters," then you will want to test your code thoroughly against a realistic variety of use cases that are likely to arise or at least may arise. If an optimization speeds up code in 50% of cases, but actually runs slower than what you previously had the other 50% of the time, obviously, it may not be worth implementing.
Regarding point 1 above: by suggesting an end user of your software might "notice" a 10ms difference, I don't mean to suggest that they will actually visibly see a difference. But if your app runs on a server with millions of connections and every little speed increase takes a substantial load off the server, that might matter to the client running the server. Or if your app performs extremely time-critical work, this is another case where the result of a 10ms speedup might be noticeable, even if the speedup itself isn't.

The only sensible approach to your question is something along the lines of "when the benefit is large enough to warrant the time you invest in exploring, implementing and testing the optimization."
The "benefit is large enough" is extremely subjective. Can you or your employer sell more units of software if you make this change? Will your user base notice? Will it give you personal gratification to have the fastest-possible code? Which of those or similar questions apply is something only you can know.
By and large, most of the software I have written (in a 20+ year career) has been "fast enough" out of the box, and the code I cared to optimize presented itself as an obvious bottleneck to the end users: Queries taking a long time, scrolling too slow, that sort of thing.

Donald Knuth made the following two statements on optimization:
"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil" [2]
and
"In established engineering
disciplines a 12 % improvement, easily
obtained, is never considered marginal
and I believe the same viewpoint
should prevail in software
engineering"[5]
src: http://en.wikipedia.org/wiki/Program_optimization

Is the optimization obfuscating your code too much?
Do you really need an optimization? if your app just runs fine then readability of the code is probably more important
Did you work on the general design and algorithms of your application before trying small hacky optimizations?

You should focus optimisation efforts on the parts of code that account for the most runtime. If a particular piece of code takes up 80% of the total runtime, then optimising it to reduce the time is takes by 5% will have as much impact as reducing the time of the rest of the code by 20%.
In general, optimisations make code less readable (not always, but often). Therefore you should avoid optimising until you are sure that there is a problem.

If it speeds up your program at all, why not implement it? You have already done the work by creating the new implementation, so you are not doing extra work by applying the new implementation.
Unless the code is THAT much harder to understand.
Also, 100 ms to 90 ms is a 10% gain in performance. A 10% gain should not be taken lightly.
The real question is, if it only took 100 ms to run in the first place, what was the point in trying to optimize it?

As long as it's fast enough, then you don't need to optimise any more. But then, you wouldn't even bother profiling if that was the case...

If the performance gain is small, consider the other factors: maintainability, risk of making the change, understandability, etc. If it reduces the ability to maintain or understand the code, it probably isn't worth doing. If it improves those attributes, then it's more reason to implement the change.

In most cases, your time is more valuable than the computer's. If you think it'll take you half an hour longer to work out what the code is doing later (say if there's a bug in it), and it's only saved you a few seconds, ever, you're at a net loss.

It depends very much on the usage scenario. I'll assume here that the code in question has been profiled and thus it is known to be the bottleneck--i.e. not just "this could be faster", but "the program would give results/finish running faster if this were faster". In situations where this is not the case--e.g. if you spend 99% of your time waiting for more data to come over an ethernet connection--then you should care about correctness but not optimize for speed.
If you are writing a piece of user interface code, what you care about is perceived speed. Generally anything under ~100 ms is perceived as "instant"--no point speeding it up.
If you are writing a piece of code for a giant server farm, then if the cost of your salary to make the code fast is less than the cost of the extra electricity for the server farm, it's worthwhile. (But be sure to prioritize your time.)
If you are writing a piece of code that is used rarely or when unattended, as long as it completes in a semi-sane duration, don't worry about it. Install scripts tend to be of this sort (unless you start running into many minutes, at which point users might start abandoning the install because it's taking too long).
If you are writing code to automate a task for someone else, then if (your time spent coding + their time spent using the optimized code) is less than (their time spent using the slow code), it's worthwhile. If you're doing this in a commercial setting, weight this by your respective salaries.
If you are writing library code that will be used by many thousands of people, always make it faster if you have time to.
If you are under time pressure to simply have something working e.g. as a demo, don't optimize (except through sensible choice of algorithms from libraries) unless the result would be so slow that it isn't even "working".
One of the biggest annoyances for me personally is finding software which perhaps initially fell into one category and then later fell into another, but for which nobody went back to do needed optimizations. Until recently Javascript performance was a great example of this. Moral of the story is: don't just decide once; revisit the issue as the situation demands.

One could use a profiler, but why not just halt the program? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
If something is making a single-thread program take, say, 10 times as long as it should, you could run a profiler on it. You could also just halt it with a "pause" button, and you'll see exactly what it's doing.
Even if it's only 10% slower than it should be, if you halt it more times, before long you'll see it repeatedly doing the unnecessary thing. Usually the problem is a function call somewhere in the middle of the stack that isn't really needed. This doesn't measure the problem, but it sure does find it.
Edit: The objections mostly assume that you only take 1 sample. If you're serious, take 10. Any line of code causing some percentage of wastage, like 40%, will appear on the stack on that fraction of samples, on average. Bottlenecks (in single-thread code) can't hide from it.
EDIT: To show what I mean, many objections are of the form "there aren't enough samples, so what you see could be entirely spurious" - vague ideas about chance. But if something of any recognizable description, not just being in a routine or the routine being active, is in effect for 30% of the time, then the probability of seeing it on any given sample is 30%.
Then suppose only 10 samples are taken. The number of times the problem will be seen in 10 samples follows a binomial distribution, and the probability of seeing it 0 times is .028. The probability of seeing it 1 time is .121. For 2 times, the probability is .233, and for 3 times it is .267, after which it falls off. Since the probability of seeing it less than two times is .028 + .121 = .139, that means the probability of seeing it two or more times is 1 - .139 = .861. The general rule is if you see something you could fix on two or more samples, it is worth fixing.
In this case, the chance of seeing it in 10 samples is 86%. If you're in the 14% who don't see it, just take more samples until you do. (If the number of samples is increased to 20, the chance of seeing it two or more times increases to more than 99%.) So it hasn't been precisely measured, but it has been precisely found, and it's important to understand that it could easily be something that a profiler could not actually find, such as something involving the state of the data, not the program counter.

On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.
This technique can reveal more performance problems in 2 minutes than any other technique I know of.

Because sometimes it works, and sometimes it gives you completely wrong answers. A profiler has a far better record of finding the right answer, and it usually gets there faster.

Doing this manually can't really be called "quick" or "effective", but there are several profiling tools which do this automatically; also known as statistical profiling.

Callstack sampling is a very useful technique for profiling, especially when looking at a large, complicated codebase that could be spending its time in any number of places. It has the advantage of measuring the CPU's usage by wall-clock time, which is what matters for interactivity, and getting callstacks with each sample lets you see why a function is being called. I use it a lot, but I use automated tools for it, such as Luke Stackwalker and OProfile and various hardware-vendor-supplied things.
The reason I prefer automated tools over manual sampling for the work I do is statistical power. Grabbing ten samples by hand is fine when you've got one function taking up 40% of runtime, because on average you'll get four samples in it, and always at least one. But you need more samples when you have a flat profile, with hundreds of leaf functions, none taking more than 1.5% of the runtime.
Say you have a lake with many different kinds of fish. If 40% of the fish in the lake are salmon (and 60% "everything else"), then you only need to catch ten fish to know there's a lot of salmon in the lake. But if you have hundreds of different species of fish, and each species is individually no more than 1%, you'll need to catch a lot more than ten fish to be able to say "this lake is 0.8% salmon and 0.6% trout."
Similarly in the games I work on, there are several major systems each of which call dozens of functions in hundreds of different entities, and all of this happens 60 times a second. Some of those functions' time funnels into common operations (like malloc), but most of it doesn't, and in any case there's no single leaf that occupies more than 1000 μs per frame.
I can look at the trunk functions and see, "we're spending 10% of our time on collision", but that's not very helpful: I need to know exactly where in collision, so I know which functions to squeeze. Just "do less collision" only gets you so far, especially when it means throwing out features. I'd rather know "we're spending an average 600 μs/frame on cache misses in the narrow phase of the octree because the magic missile moves so fast and touches lots of cells," because then I can track down the exact fix: either a better tree, or slower missiles.
Manual sampling would be fine if there were a big 20% lump in, say, stricmp, but with our profiles that's not the case. Instead I have hundreds of functions that I need to get from, say, 0.6% of frame to 0.4% of frame. I need to shave 10 μs off every 50 μs function that is called 300 times per second. To get that kind of precision, I need more samples.
But at heart what Luke Stackwalker does is what you describe: every millisecond or so, it halts the program and records the callstack (including the precise instruction and line number of the IP). Some programs just need tens of thousands of samples to be usefully profiled.
(We've talked about this before, of course, but I figured this was a good place to summarize the debate.)

There's a difference between things that programmers actually do, and things that they recommend others do.
I know of lots of programmers (myself included) that actually use this method. It only really helps to find the most obvious of performance problems, but it's quick and dirty and it works.
But I wouldn't really tell other programmers to do it, because it would take me too long to explain all the caveats. It's far too easy to make an inaccurate conclusion based on this method, and there are many areas where it just doesn't work at all. (for example, that method doesn't reveal any code that is triggered by user input).
So just like using lie detectors in court, or the "goto" statement, we just don't recommend that you do it, even though they all have their uses.

I'm surprised by the religous tone on both sides.
Profiling is great, and certainly is a more refined and precise when you can do it. Sometimes you can't, and it's nice to have a trusty back-up. The pause technique is like the manual screwdriver you use when your power tool is too far away or the bateries have run-down.
Here is a short true story. An application (kind of a batch proccessing task) had been running fine in production for six months, suddenly the operators are calling developers because it is going "too slow". They aren't going to let us attach a sampling profiler in production! You have to work with the tools already installed. Without stopping the production process, just using Process Explorer, (which operators had already installed on the machine) we could see a snapshot of a thread's stack. You can glance at the top of the stack, dismiss it with the enter key and get another snapshot with another mouse click. You can easily get a sample every second or so.
It doesn't take long to see if the top of the stack is most often in the database client library DLL (waiting on the database), or in another system DLL (waiting for a system operation), or actually in some method of the application itself. In this case, if I remember right, we quickly noticed that 8 times out of 10 the application was in a system DLL file call reading or writing a network file. Sure enough recent "upgrades" had changed the performance characteristics of a file share. Without a quick and dirty and (system administrator sanctioned) approach to see what the application was doing in production, we would have spent far more time trying to measure the issue, than correcting the issue.
On the other hand, when performance requirements move beyond "good enough" to really pushing the envelope, a profiler becomes essential so that you can try to shave cycles from all of your closely-tied top-ten or twenty hot spots. Even if you are just trying to hold to a moderate performance requirement durring a project, when you can get the right tools lined-up to help you measure and test, and even get them integrated into your automated test process it can be fantasticly helpful.
But when the power is out (so to speak) and the batteries are dead, it's nice know how to use that manual screwdriver.
So the direct answer: Know what you can learn from halting the program, but don't be afraid of precision tools either. Most importantly know which jobs call for which tools.

Hitting the pause button during the execution of a program in "debug" mode might not provide the right data to perform any performance optimizations. To put it bluntly, it is a crude form of profiling.
If you must avoid using a profiler, a better bet is to use a logger, and then apply a slowdown factor to "guesstimate" where the real problem is. Profilers however, are better tools for guesstimating.
The reason why hitting the pause button in debug mode, may not give a real picture of application behavior is because debuggers introduce additional executable code that can slowdown certain parts of the application. One can refer to Mike Stall's blog post on possible reasons for application slowdown in a debugging environment. The post sheds light on certain reasons like too many breakpoints,creation of exception objects, unoptimized code etc. The part about unoptimized code is important - the "debug" mode will result in a lot of optimizations (usually code in-lining and re-ordering) being thrown out of the window, to enable the debug host (the process running your code) and the IDE to synchronize code execution. Therefore, hitting pause repeatedly in "debug" mode might be a bad idea.

If we take the question "Why isn't it better known?" then the answer is going to be subjective. Presumably the reason why it is not better known is because profiling provides a long term solution rather than a current problem solution. It isn't effective for multi-threaded applications and isn't effective for applications like games which spend a significant portion of its time rendering.
Furthermore, in single threaded applications if you have a method that you expect to consume the most run time, and you want to reduce the run-time of all other methods then it is going to be harder to determine which secondary methods to focus your efforts upon first.
Your process for profiling is an acceptable method that can and does work, but profiling provides you with more information and has the benefit of showing you more detailed performance improvements and regressions.
If you have well instrumented code then you can examine more than just the how long a particular method; you can see all the methods.
With profiling:
You can then rerun your scenario after each change to determine the degree of performance improvement/regression.
You can profile the code on different hardware configurations to determine if your production hardware is going to be sufficient.
You can profile the code under load and stress testing scenarios to determine how the volume of information impacts performance
You can make it easier for junior developers to visualise the impacts of their changes to your code because they can re-profile the code in six months time while you're off at the beach or the pub, or both. Beach-pub, ftw.
Profiling is given more weight because enterprise code should always have some degree of profiling because of the benefits it gives to the organisation of an extended period of time. The more important the code the more profiling and testing you do.
Your approach is valid and is another item is the toolbox of the developer. It just gets outweighed by profiling.

Sampling profilers are only useful when
You are monitoring a runtime with a small number of threads. Preferably one.
The call stack depth of each thread is relatively small (to reduce the incredible overhead in collecting a sample).
You are only concerned about wall clock time and not other meters or resource bottlenecks.
You have not instrumented the code for management and monitoring purposes (hence the stack dump requests)
You mistakenly believe removing a stack frame is an effective performance improvement strategy whether the inherent costs (excluding callees) are practically zero or not
You can't be bothered to learn how to apply software performance engineering day-to-day in your job
....

Stack trace snapshots only allow you to see stroboscopic x-rays of your application. You may require more accumulated knowledge which a profiler may give you.
The trick is knowing your tools well and choose the best for the job at hand.

These must be some trivial examples that you are working with to get useful results with your method. I can't think of a project where profiling was useful (by whatever method) that would have gotten decent results with your "quick and effective" method. The time it takes to start and stop some applications already puts your assertion of "quick" in question.
Again, with non-trivial programs the method you advocate is useless.
EDIT:
Regarding "why isn't it better known"?
In my experience code reviews avoid poor quality code and algorithms, and profiling would find these as well. If you wish to continue with your method that is great - but I think for most of the professional community this is so far down on the list of things to try that it will never get positive reinforcement as a good use of time.
It appears to be quite inaccurate with small sample sets and to get large sample sets would take lots of time that would have been better spent with other useful activities.

What if the program is in production and being used at the same time by paying clients or colleagues. A profiler allows you to observe without interferring (as much, because of course it will have a little hit too as per the Heisenberg principle).
Profiling can also give you much richer and more detailed accurate reports. This will be quicker in the long run.

EDIT 2008/11/25: OK, Vineet's response has finally made me see what the issue is here. Better late than never.
Somehow the idea got loose in the land that performance problems are found by measuring performance. That is confusing means with ends. Somehow I avoided this by single-stepping entire programs long ago. I did not berate myself for slowing it down to human speed. I was trying to see if it was doing wrong or unnecessary things. That's how to make software fast - find and remove unnecessary operations.
Nobody has the patience for single-stepping these days, but the next best thing is to pick a number of cycles at random and ask what their reasons are. (That's what the call stack can often tell you.) If a good percentage of them don't have good reasons, you can do something about it.
It's harder these days, what with threading and asynchrony, but that's how I tune software - by finding unnecessary cycles. Not by seeing how fast it is - I do that at the end.
Here's why sampling the call stack cannot give a wrong answer, and why not many samples are needed.
During the interval of interest, when the program is taking more time than you would like, the call stack exists continuously, even when you're not sampling it.
If an instruction I is on the call stack for fraction P(I) of that time, removing it from the program, if you could, would save exactly that much. If this isn't obvious, give it a bit of thought.
If the instruction shows up on M = 2 or more samples, out of N, its P(I) is approximately M/N, and is definitely significant.
The only way you can fail to see the instruction is to magically time all your samples for when the instruction is not on the call stack. The simple fact that it is present for a fraction of the time is what exposes it to your probes.
So the process of performance tuning is a simple matter of picking off instructions (mostly function call instructions) that raise their heads by turning up on multiple samples of the call stack. Those are the tall trees in the forest.
Notice that we don't have to care about the call graph, or how long functions take, or how many times they are called, or recursion.
I'm against obfuscation, not against profilers. They give you lots of statistics, but most don't give P(I), and most users don't realize that that's what matters.
You can talk about forests and trees, but for any performance problem that you can fix by modifying code, you need to modify instructions, specifically instructions with high P(I). So you need to know where those are, preferably without playing Sherlock Holmes. Stack sampling tells you exactly where they are.
This technique is harder to employ in multi-thread, event-driven, or systems in production. That's where profilers, if they would report P(I), could really help.

Stepping through code is great for seeing the nitty-gritty details and troubleshooting algorithms. It's like looking at a tree really up close and following each vein of bark and branch individually.
Profiling lets you see the big picture, and quickly identify trouble points -- like taking a step backwards and looking at the whole forest and noticing the tallest trees. By sorting your function calls by length of execution time, you can quickly identify the areas that are the trouble points.

I used this method for Commodore 64 BASIC many years ago. It is surprising how well it works.

I've typically used it on real-time programs that were overrunning their timeslice. You can't manually stop and restart code that has to run 60 times every second.
I've also used it to track down the bottleneck in a compiler I had written. You wouldn't want to try to break such a program manually, because you really have no way of knowing if you are breaking at the spot where the bottlenck is, or just at the spot after the bottleneck when the OS is allowed back in to stop it. Also, what if the major bottleneck is something you can't do anything about, but you'd like to get rid of all the other largeish bottlenecks in the system? How to you prioritize which bottlenecks to attack first, when you don't have good data on where they all are, and what their relative impact each is?

The larger your program gets, the more useful a profiler will be. If you need to optimize a program which contains thousands of conditional branches, a profiler can be indispensible. Feed in your largest sample of test data, and when it's done import the profiling data into Excel. Then you check your assumptions about likely hot spots against the actual data. There are always surprises.

Can you estimate an application's performance before testing?

It's a tricky question I was asked the other day... We're working on a pretty complex telephony (SIP) application with mixed C++ and PHP code with MySQL databases and several open source components.
A telecom engineer asked us to estimate the performance of the application (which is not ready yet). He went like 'well, you know how many packets can pass through the Linux kernel per second, plus you might know how quick your app is, so tell me how many calls will pass through your stuff per second'.
Seems nonsense to me, as there are a million scenarios that might happen (well, literally...)
However... is there a way to estimate application performance (knowing the hardware it will run on, being able to run standard benchmarks on it, etc) before actual testing?

You certainly can bound the problem with upper (max throughput) limits. There is nothing nonsense about that. In fact, not knowing that stuff indicates a pretty haphazard approach to a problem - especially in the telephony world.
You can work through the problem yourself - what is the minimum "work" you have to accomplish for a transaction or whatever unit of task you have in your app?
Some messages to and from, some processing and a database hit for example? Getting information on the individual pieces will give you an idea of the fastest possible throughput. If you load up the system and see significantly lower performance then you can take time to figure out where you are possibly losing throughput with inefficient algorithms, etc.
EDIT
To do this exercise you need to know all the steps your app does for each use case. Then you can identify the max throughput for each use case. You should definitely know this stuff prior to release and going live.
I'm ignoring the worst case analysis as that - as you point out - is quite a bit harder.

See Capacity Planning for Web Performance: Metrics, Models, and Methods. There are also some tools that can do this sort of discrete event simulation:
Hyperformix
SimPy
WikiPedia list of simulation tools
This stuff ain't easy, and the commercial tools will cost ya. The Capacity Planning book comes with a CD with lots of Excel workbook templates and examples of models that can jump start you.
Good luck :)

If you really have to answer this you could say something like this:
"I don't know off the top of my head. I am will to estimate this for you but it will take time. Obviously the accuracy of my answer depends upon how much effort (I.E. time) I put into calculating my estimate. How much time should I put into calculating my estimate?"
Put the burden back on them. If they really want an accurate answer, they're going to have to let you build at least some test applications that can simulate the actual environment.

You can spike to measure performance. Your whole system may not be working yet, but you know how the parts are intended to fit together. You can whip something up in a few hours that does the same kind of work as the final app will, across all the layers, and use it to measure performance of your design.
Remember: prototypes are broad, spikes are deep.

You should do the estimate. An estimate won't give you the right answer. It will however make you to think about the problem. Right now it sounds like your coding and hoping that everything will be OK. Or you are in panic mode and feel you don't have time for estimates.
Spend some time thinking about it. Analyse the important use cases. Think about the memory you may need; think about database access; think about network access (local and remote). These will effect the performance of your system. Get the whole team together to do this.
Regularly measure your system's performance during development for these important use cases. Mock up components/other systems if you have to. Analyse the results. How do these compare to your estimate. Maybe components are memory/database/network bound. Maybe you need more memory; less database access; simpler queries; caching. You don't have to make these changes straight away. However you do know how your system operates and what you need to do.
Result: Fewer nasty surprises at system test. Less panic as the release date looms.

You can definitely do capacity planning in advance, but the quality of the estimate will depend on the quality of the data available.
The best estimate is to build the system in test, run simulated workloads, then predict capacity as a function of performance requirements and workload. These 3 form a prediction space - given 2 of the 3, you can predict the third:
Given performance requirements and capacity (i.e. hardware) you can calculate the workload you can handle.
Given performance requirements and workload, you can calculate the capacity (i.e. hardware) that you need.
Given Workload and capacity, you can predict your expected performance.

This is true in some domains, but unless you are an expert in that domain then you don't have any idea. For example I write code to controlling industrial robots. The speed is limited by the robot motion, not by the execution speed of the code. Knowing how fast the robot is and how far it has to go, we can make fairly good estimates of "speed". I'd have no idea how to estimate time for your application.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio