Do (sampling) profilers still "lie" these days? - performance

Most of my limited experience with profiling native code is on a GPU rather than on a CPU, but I see some CPU profiling in my future...
Now, I've just read this blog post:
How profilers lie: The case of gprof and KCacheGrind
about how what profilers measure and what they show you, which is likely not what you expect if you're interested in discerning between different call paths and the time spent in them.
My question is: Is this still the case today (5 years later)? That is, do sampling profilers (i.e. those who don't slow execution down terribly) still behave the way gprof used to (or callgrind without --separate-callers=N)? Or do profilers nowadays customarily record the entire call stack when sampling?

No, many modern sampling profilers don't exhibit the problem described regarding gprof.
In fact, even when that was written, the specific problem was actually more a quirk of the way gprof uses a mix of instrumentation and sampling and then tries to reconstruct a hypothetical call graph based on limited caller/callee information and combine that with the sampled timing information.
Modern sampling profilers, such as perf, VTune, and various language-specific profilers to languages that don't compile to native code can capture the full call stack with each sample, which provides accurate times with respect to that issue. Alternately, you might sample without collecting call stacks (which reduces greatly the sampling cost) and then present the information without any caller/callee information which would still be accurate.
This was largely true even in the past, so I think it's fair to say that sampling profilers never, as a group, really exhibited that problem.
Of course, there are still various ways in which profilers can lie. For example, getting results accurate to the instruction level is a very tricky problem, given modern CPUs with 100s of instructions in flight at once, possibly across many functions, and complex performance models where instructions may have a very different in-context cost as compared to their nominal latency and throughput values. Even that tricky issues can be helped with "hardware assist" such as on recent x86 chips with PEBS support and later related features that help you pin-point an instruction in a less biased way.

Regarding gprof, yes, it's still the case today. This is by design, to keep the profiling overhead small. From the up-to-date documentation:
Some of the figures in the call graph are estimates—for example, the
children time values and all the time figures in caller and subroutine
lines.
There is no direct information about these measurements in the profile
data itself. Instead, gprof estimates them by making an assumption
about your program that might or might not be true.
The assumption made is that the average time spent in each call to any
function foo is not correlated with who called foo. If foo used 5
seconds in all, and 2/5 of the calls to foo came from a, then foo
contributes 2 seconds to a’s children time, by assumption.
Regarding KCacheGrind, little has changed since the article was written. You can check out the change log and see that the latest version was published in April 5, 2013, which includes unrelated changes. You can also refer to Josef Weidendorfer's comments under the article (Josef is the author of KCacheGrind).

If you noticed, I contributed several comments to that post you referenced, but it's not just that profilers give you bad information, it's that people fool themselves about what performance actually is.
What is your goal? Is it to A) find out how to make the program as fast as possible? Or is it to B) measure time taken by various functions, hoping that will lead to A? (Hint - it doesn't.) Here's a detailed list of the issues.
To illustrate: You could, for example, be calling a teeny innocent-looking little function somewhere that just happens to invoke nine yards of system code including reading a .dll to extract a string resource in order to internationalize it. This could be taking 50% of wall-clock time and therefore be on the stack 50% of wall-clock time. Would a "CPU-profiler" show it to you? No, because practically all of that 50% is doing I/O. Do you need many many stack samples to know to 3 decimal places exactly how much time it's taking? Of course not. If you only got 10 samples it would be on 5 of them, give or take. Once you know that teeny routine is a big problem, does that mean you're out of luck because somebody else wrote it? What if you knew what the string was that it was looking up? Does it really need to be internationalized, so much so that you're willing to pay a factor of two in slowness just for that? Do you see how useless measurements are when your real problem is to understand qualitatively what takes time?
I could go on and on with examples like this...

Related

Profiling vs debugging what is the practice

Most often, I do wounder how best to make my applications have optimal performance. How to optimize and identify functions/methods that are more resource intensive than the others and make necessary adjustments. In software development irrespective of language I believe that, there should some ways of finding out how the processor/network resources are being used by different parts of my codes. I will illustrate what I mean using the simplest example I can think of: I have background on Java, Python and PHP and feel more comfortable working on linux environment. Please feel free to advice me using any of these languages you are comfortable with:
In Javascript one can comfortably test and assign a value to a variable by doing:
//METHOD 1:
if(true){
console.log("It will always be true");
}else{
console.log("You can never see me");
}
//METHOD 2:
var print;
if(true){
print="It will always be true";
}else{
print="You can never see me";
}
console.log(print);
//METHOD 3:
console.log((true)? "It will always be true" : "You can never see me");
If different people were to be asked which of these methods will perform faster than the other. I am sure that different individuals will come up with different ideas. But I need a more reliable way to know about resource usage both on desktop and mobile applications. Thanks.
First of all, Debugging is done when you know the exact bug or wrong functionality.
It is done for functional testing i.e. to check defects in application.
For performance, profilers are used to find out resource intensive methods. they will provide heavy modules/functions/DB queries etc. so that after analysis you can tune and improve your system performance.If after modification some defect arises then you debugger can be used to pinpoint and correct the issue.
There are many opensource as well as paid profilers for java(I am saying java because you pasted javascript code).
Please have a look at them and use them to tune your system.
IMHO downvoting was for 2 reasons bad english and very basic question.
In the examples you gave, it simply doesn't matter. That's like asking which is faster, a snail or a worm BUT, if by
"profiling" you mean "measurements of time taken by functions", and if by
"debugging performance problem" you mean "the bug is that time is being spent unnecessarily, and I need to find out why",
then I strongly agree with the premise of your question.
Debugging works much better.
A performance problem consists of excess time being spent for unnecessary reasons.
The way to find it is by breaking into the execution at random times, which will naturally gravitate toward whatever is taking the most time.
The more time the problem takes, the more likely the break is to land in the problem.
Just do it several times, and each time look carefully at the program's state, to understand why it's doing whatever it's doing.
If you see it doing something that can be avoided, and you see it doing it on more than one break, you've found a performance problem.
Fix it and observe the speedup.
Then repeat the process, because what was the next biggest problem is now the biggest one.
The speedups multiply together.
Here's an example where the time goes from 2700us to 1800, then to 1500, 1300, 440, 170, and finally 3.7us. None of the speedups were all that big as fractions of the original time, but in the aggregate, they were stunning.
This is very much different from measuring.
In measuring, the first thing that happens is you assume numerical accuracy is important, so you assume you either need lots of samples or you have to instrument the code to measure time precisely.
As if numerical accuracy helps to find the problem.
This false assumption entered programmers' consciousness around 1982, when GPROF appeared.
In measuring, you assume the problem can be localized to a function, when in fact you need to see all of what's happening at a point in time to know if it can be avoided.
IMHO, the best profilers sample the stack, on wall-clock time, and report for each line of code that appears on samples, the percent of samples it appears on.
However, even these profilers don't tell you the context that you can get by simply looking carefully at individual stack samples, and data as well.
(Other forms of eye-candy have the same problem: call-graph, hot-path, flame-graph, etc.)

Most hazardous performance bottleneck misconceptions

The guys who wrote Bespin (cloud-based canvas-based code editor [and more]) recently spoke about how they re-factored and optimize a portion of the Bespin code because of a misconception that JavaScript was slow. It turned out that when all was said and done, their optimization produced no significant improvements.
I'm sure many of us go out of our way to write "optimized" code based on misconceptions similar to that of the Bespin team.
What are some common performance bottleneck misconceptions developers commonly subscribe to?
In no particular order:
"Ready, Fire, Aim" - thinking you know what needs to be optimized without proving it (i.e. guessing) and then acting on that, and since it doesn't help much, therefore assuming the code must have been optimal to begin with.
"Penny Wise, Pound Foolish" - thinking that optimization is all about compiler optimization, fussing about ++i vs. i++ while mountains of time are being spent needlessly in overblown designs, especially of data structures and databases.
"Swat Flies With a Bazooka" - being so enamored of the fanciest ideas heard in classrooms that they are just used for everything, regardless of scale.
"Fuzzy Thinking about Performance" - throwing around terms like "hotspot" and "bottleneck" and "profiler" and "measure" as if these things were well understood and / or relevant. (I bet I get clobbered for that!) OK, one at a time:
hotspot - What's the definition? I have one: it is a region of physical addresses where the PC register is found a significant fraction of time. It is the kind of thing PC samplers are good at finding. Many performance problems exhibit hotspots, but only in the simplest programs is the problem in the same place as the hotspot is.
bottleneck - A catch-all term used for performance problems, it implies a limited channel constraining how fast work can be accomplished. The unstated assumption is that the work is necessary. In my decades of performance tuning, I have in fact found a few problems like that - very few. Nearly all are of a very different nature. Rather than taking the shortest route from point A to B, little detours are taken, in the form of function calls which take little code, but not little time. Then those detours take further nested detours, sometimes 30 levels deep. The more the detours are nested, the more likely it is that some of them are less than necessary - wasteful, in fact - and it nearly always arises from galloping generality - unquestioning over-indulgence in "abstraction".
profiler - a universal good thing, right? All you gotta do is get a profiler and do some profiling, right? Ever think about how easy it is to fool a profiler into telling you a lot of nothing, when your goal is to find out what you need to fix to get better performance? Somewhere in your call tree, tuck a little file I/O, or a little call to some system routine, or have your evil twin do it without your knowledge. At some point, that will be your problem, and most profilers will miss it completely because the only inefficiency they contemplate is algorithmic inefficiency. Or, not all your routines will be little, and they may not call another routine in a small number of places, so your call graph says there's a link between the two routines, but which call? Or suppose you can figure out that some big percent of time is spent in a path A calls B calls C. You can look at that and think there's not much you can do about it, when if you could also look at the data being passed in those calls, you could see if it's necesssary. Here's a fun project - pick your favorite profiler, and then see how many ways you could fool it. That is, find ways to make the program take longer without the profiler being able to tell what you did, because if you can do it intentionally, you can also do it without intending to.
measure - (i.e. measure time) that's what profilers have done for decades, and they take pride in the accuracy and precision with which they measure. But measure what time? and why with accuracy? Remember, the goal is to precisely locate performance problems, such that you could fruitfully optimize them to gain a speedup. When you get that speedup, it is what it is, regardless of how precisely you estimated it beforehand. If that precision of measurement is bought at the expense of precision of location, then you've just bought apples when what you needed was oranges.
Here's a list of myths about performance.
And this is what happens when one optimizes without a valid profile in hand. All you are doing without a profile is guessing and probably wasting time and money. I could list a bunch of other misconceptions, but many come down to the fact that if the code in question isn't a top resource consumer, it is probably fine as is. Kinda like unrolling a for loop that is doing disk I/O...
If I convert the whole code base over to [Insert xxx latest technology here], it'll be much faster.
Relational databases are slow.
I'm smarter than the optimizer.
This should be optimized.
Java is slow
And, unrelated:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-jwz
Optimizing the WRONG part of the code (people, use your profiler!).
The optimizer in my compiler is smart, so I don't have to help it.
Java is fast (LOL)
Relational databases are fast (ROTFL LOL LMAO)
"This has to be as fast as possible."
If you don't have a performance problem, you don't need to worry about optimizing performance (beyond just paying attention to using good algorithms).
This misconception also manifests in attempts to optimize performance for every single aspect of your program. I see this most often with people trying to shave every last millisecond off of a low-volume web application's execution time while failing to take into account that network latency will take orders of magnitude longer than their code's execution time, making any reduction in execution time irrelevant anyhow.
My rules of optimization.
Don't optimize
Don't optimize now.
Profile to identify the problem.
Optimize the component that is taking at least 80% of the time.
Find an optimization that is 10 times faster.
My best optimization has been reducing a report from 3 days to 9 minutes. The optimized code was sped up from three days to three minutes.
In my carreer I have met three people who had been tasked with producing a faster sort on VAX than the native sort. They invariably had been able to produce sorts that took only three times longer.
The rules are simple:
Try to use standard library functions first.
Try to use brute-force and ignorance second.
Prove you've got a problem before trying to do any optimization.

Performance optimization strategies of last resort [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.
Let's assume:
the code already is working correctly
the algorithms chosen are already optimal for the circumstances of the problem
the code has been measured, and the offending routines have been isolated
all attempts to optimize will also be measured to ensure they do not make matters worse
What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.
Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.
I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.
OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:
The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.
Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.
Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.
Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.
That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.
Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.
More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.
Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.
Total speedup factor: 43.6
Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.
P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.
I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.
ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:
/* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)
These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.
Here is the second problem, in two separate lines:
/* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)
These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.
I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.
REFERENCE ADDED:
The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.
EDIT 2011/11/26:
There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.
Suggestions:
Pre-compute rather than re-calculate: any loops or repeated calls that contain calculations that have a relatively limited range of inputs, consider making a lookup (array or dictionary) that contains the result of that calculation for all values in the valid range of inputs. Then use a simple lookup inside the algorithm instead.
Down-sides: if few of the pre-computed values are actually used this may make matters worse, also the lookup may take significant memory.
Don't use library methods: most libraries need to be written to operate correctly under a broad range of scenarios, and perform null checks on parameters, etc. By re-implementing a method you may be able to strip out a lot of logic that does not apply in the exact circumstance you are using it.
Down-sides: writing additional code means more surface area for bugs.
Do use library methods: to contradict myself, language libraries get written by people that are a lot smarter than you or me; odds are they did it better and faster. Do not implement it yourself unless you can actually make it faster (i.e.: always measure!)
Cheat: in some cases although an exact calculation may exist for your problem, you may not need 'exact', sometimes an approximation may be 'good enough' and a lot faster in the deal. Ask yourself, does it really matter if the answer is out by 1%? 5%? even 10%?
Down-sides: Well... the answer won't be exact.
When you can't improve the performance any more - see if you can improve the perceived performance instead.
You may not be able to make your fooCalc algorithm faster, but often there are ways to make your application seem more responsive to the user.
A few examples:
anticipating what the user is going
to request and start working on that
before then
displaying results as
they come in, instead of all at once
at the end
Accurate progress meter
These won't make your program faster, but it might make your users happier with the speed you have.
I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:
Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
Sequential floating-point ops. Make these SIMD.
And one more thing I like to do:
Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.
Throw more hardware at it!
More suggestions:
Avoid I/O: Any I/O (disk, network, ports, etc.) is
always going to be far slower than any code that is
performing calculations, so get rid of any I/O that you do
not strictly need.
Move I/O up-front: Load up all the data you are going
to need for a calculation up-front, so that you do not
have repeated I/O waits within the core of a critical
algorithm (and maybe as a result repeated disk seeks, when
loading all the data in one hit may avoid seeking).
Delay I/O: Do not write out your results until the
calculation is over, store them in a data structure and
then dump that out in one go at the end when the hard work
is done.
Threaded I/O: For those daring enough, combine 'I/O
up-front' or 'Delay I/O' with the actual calculation by
moving the loading into a parallel thread, so that while
you are loading more data you can work on a calculation on
the data you already have, or while you calculate the next
batch of data you can simultaneously write out the results
from the last batch.
Since many of the performance problems involve database issues, I'll give you some specific things to look at when tuning queries and stored procedures.
Avoid cursors in most databases. Avoid looping as well. Most of the time, data access should be set-based, not record by record processing. This includes not reusing a single record stored procedure when you want to insert 1,000,000 records at once.
Never use select *, only return the fields you actually need. This is especially true if there are any joins as the join fields will be repeated and thus cause unnecesary load on both the server and the network.
Avoid the use of correlated subqueries. Use joins (including joins to derived tables where possible) (I know this is true for Microsoft SQL Server, but test the advice when using a differnt backend).
Index, index, index. And get those stats updated if applicable to your database.
Make the query sargable. Meaning avoid things which make it impossible to use the indexes such as using a wildcard in the first character of a like clause or a function in the join or as the left part of a where statement.
Use correct data types. It is faster to do date math on a date field than to have to try to convert a string datatype to a date datatype, then do the calculation.
Never put a loop of any kind into a trigger!
Most databases have a way to check how the query execution will be done. In Microsoft SQL Server this is called an execution plan. Check those first to see where problem areas lie.
Consider how often the query runs as well as how long it takes to run when determining what needs to be optimized. Sometimes you can gain more perfomance from a slight tweak to a query that runs millions of times a day than you can from wiping time off a long_running query that only runs once a month.
Use some sort of profiler tool to find out what is really being sent to and from the database. I can remember one time in the past where we couldn't figure out why the page was so slow to load when the stored procedure was fast and found out through profiling that the webpage was asking for the query many many times instead of once.
The profiler will also help you to find who are blocking who. Some queries that execute quickly while running alone may become really slow due to locks from other queries.
The single most important limiting factor today is the limited memory bandwitdh. Multicores are just making this worse, as the bandwidth is shared betwen cores. Also, the limited chip area devoted to implementing caches is also divided among the cores and threads, worsening this problem even more. Finally, the inter-chip signalling needed to keep the different caches coherent also increase with an increased number of cores. This also adds a penalty.
These are the effects that you need to manage. Sometimes through micro managing your code, but sometimes through careful consideration and refactoring.
A lot of comments already mention cache friendly code. There are at least two distinct flavors of this:
Avoid memory fetch latencies.
Lower memory bus pressure (bandwidth).
The first problem specifically has to do with making your data access patterns more regular, allowing the hardware prefetcher to work efficiently. Avoid dynamic memory allocation which spreads your data objects around in memory. Use linear containers instead of linked lists, hashes and trees.
The second problem has to do with improving data reuse. Alter your algorithms to work on subsets of your data that do fit in available cache, and reuse that data as much as possible while it is still in the cache.
Packing data tighter and making sure you use all data in cache lines in the hot loops, will help avoid these other effects, and allow fitting more useful data in the cache.
What hardware are you running on? Can you use platform-specific optimizations (like vectorization)?
Can you get a better compiler? E.g. switch from GCC to Intel?
Can you make your algorithm run in parallel?
Can you reduce cache misses by reorganizing data?
Can you disable asserts?
Micro-optimize for your compiler and platform. In the style of, "at an if/else, put the most common statement first"
Although I like Mike Dunlavey's answer, in fact it is a great answer indeed with supporting example, I think it could be expressed very simply thus:
Find out what takes the largest amounts of time first, and understand why.
It is the identification process of the time hogs that helps you understand where you must refine your algorithm. This is the only all-encompassing language agnostic answer I can find to a problem that's already supposed to be fully optimised. Also presuming you want to be architecture independent in your quest for speed.
So while the algorithm may be optimised, the implementation of it may not be. The identification allows you to know which part is which: algorithm or implementation. So whichever hogs the time the most is your prime candidate for review. But since you say you want to squeeze the last few % out, you might want to also examine the lesser parts, the parts that you have not examined that closely at first.
Lastly a bit of trial and error with performance figures on different ways to implement the same solution, or potentially different algorithms, can bring insights that help identify time wasters and time savers.
HPH,
asoudmove.
You should probably consider the "Google perspective", i.e. determine how your application can become largely parallelized and concurrent, which will inevitably also mean at some point to look into distributing your application across different machines and networks, so that it can ideally scale almost linearly with the hardware that you throw at it.
On the other hand, the Google folks are also known for throwing lots of manpower and resources at solving some of the issues in projects, tools and infrastructure they are using, such as for example whole program optimization for gcc by having a dedicated team of engineers hacking gcc internals in order to prepare it for Google-typical use case scenarios.
Similarly, profiling an application no longer means to simply profile the program code, but also all its surrounding systems and infrastructure (think networks, switches, server, RAID arrays) in order to identify redundancies and optimization potential from a system's point of view.
Inline routines (eliminate call/return and parameter pushing)
Try eliminating tests/switches with table look ups (if they're faster)
Unroll loops (Duff's device) to the point where they just fit in the CPU cache
Localize memory access so as not to blow your cache
Localize related calculations if the optimizer isn't already doing that
Eliminate loop invariants if the optimizer isn't already doing that
When you get to the point that you're using efficient algorithms its a question of what you need more speed or memory. Use caching to "pay" in memory for more speed or use calculations to reduce the memory footprint.
If possible (and more cost effective) throw hardware at the problem - faster CPU, more memory or HD could solve the problem faster then trying to code it.
Use parallelization if possible - run part of the code on multiple threads.
Use the right tool for the job. some programing languages create more efficient code, using managed code (i.e. Java/.NET) speed up development but native programing languages creates faster running code.
Micro optimize. Only were applicable you can use optimized assembly to speed small pieces of code, using SSE/vector optimizations in the right places can greatly increase performance.
Divide and conquer
If the dataset being processed is too large, loop over chunks of it. If you've done your code right, implementation should be easy. If you have a monolithic program, now you know better.
First of all, as mentioned in several prior answers, learn what bites your performance - is it memory or processor or network or database or something else. Depending on that...
...if it's memory - find one of the books written long time ago by Knuth, one of "The Art of Computer Programming" series. Most likely it's one about sorting and search - if my memory is wrong then you'll have to find out in which he talks about how to deal with slow tape data storage. Mentally transform his memory/tape pair into your pair of cache/main memory (or in pair of L1/L2 cache) respectively. Study all the tricks he describes - if you don's find something that solves your problem, then hire professional computer scientist to conduct a professional research. If your memory issue is by chance with FFT (cache misses at bit-reversed indexes when doing radix-2 butterflies) then don't hire a scientist - instead, manually optimize passes one-by-one until you're either win or get to dead end. You mentioned squeeze out up to the last few percent right? If it's few indeed you'll most likely win.
...if it's processor - switch to assembly language. Study processor specification - what takes ticks, VLIW, SIMD. Function calls are most likely replaceable tick-eaters. Learn loop transformations - pipeline, unroll. Multiplies and divisions might be replaceable / interpolated with bit shifts (multiplies by small integers might be replaceable with additions). Try tricks with shorter data - if you're lucky one instruction with 64 bits might turn out replaceable with two on 32 or even 4 on 16 or 8 on 8 bits go figure. Try also longer data - eg your float calculations might turn out slower than double ones at particular processor. If you have trigonometric stuff, fight it with pre-calculated tables; also keep in mind that sine of small value might be replaced with that value if loss of precision is within allowed limits.
...if it's network - think of compressing data you pass over it. Replace XML transfer with binary. Study protocols. Try UDP instead of TCP if you can somehow handle data loss.
...if it's database, well, go to any database forum and ask for advice. In-memory data-grid, optimizing query plan etc etc etc.
HTH :)
Caching! A cheap way (in programmer effort) to make almost anything faster is to add a caching abstraction layer to any data movement area of your program. Be it I/O or just passing/creation of objects or structures. Often it's easy to add caches to factory classes and reader/writers.
Sometimes the cache will not gain you much, but it's an easy method to just add caching all over and then disable it where it doesn't help. I've often found this to gain huge performance without having to micro-analyse the code.
I think this has already been said in a different way. But when you're dealing with a processor intensive algorithm, you should simplify everything inside the most inner loop at the expense of everything else.
That may seem obvious to some, but it's something I try to focus on regardless of the language I'm working with. If you're dealing with nested loops, for example, and you find an opportunity to take some code down a level, you can in some cases drastically speed up your code. As another example, there are the little things to think about like working with integers instead of floating point variables whenever you can, and using multiplication instead of division whenever you can. Again, these are things that should be considered for your most inner loop.
Sometimes you may find benefit of performing your math operations on an integer inside the inner loop, and then scaling it down to a floating point variable you can work with afterwards. That's an example of sacrificing speed in one section to improve the speed in another, but in some cases the pay off can be well worth it.
I've spent some time working on optimising client/server business systems operating over low-bandwidth and long-latency networks (e.g. satellite, remote, offshore), and been able to achieve some dramatic performance improvements with a fairly repeatable process.
Measure: Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system.
Analyze: Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (e.g. closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level.
Cache: Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed - many common caching methods (e.g. HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach.
Parallelise: Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip.
Combine: Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities.
Compress: Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (i.e. within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance.
Repeat: Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.
In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.
Very difficult to give a generic answer to this question. It really depends on your problem domain and technical implementation. A general technique that is fairly language neutral: Identify code hotspots that cannot be eliminated, and hand-optimize assembler code.
Last few % is a very CPU and application dependent thing....
cache architectures differ, some chips have on-chip RAM
you can map directly, ARM's (sometimes) have a vector
unit, SH4's a useful matrix opcode. Is there a GPU -
maybe a shader is the way to go. TMS320's are very
sensitive to branches within loops (so separate loops and
move conditions outside if possible).
The list goes on.... But these sorts of things really are
the last resort...
Build for x86, and run Valgrind/Cachegrind against the code
for proper performance profiling. Or Texas Instruments'
CCStudio has a sweet profiler. Then you'll really know where
to focus...
Not nearly as in depth or complex as previous answers, but here goes:
(these are more beginner/intermediate level)
obvious: dry
run loops backwards so you're always comparing to 0 rather than a variable
use bitwise operators whenever you can
break repetitive code into modules/functions
cache objects
local variables have slight performance advantage
limit string manipulation as much as possible
Did you know that a CAT6 cable is capable of 10x better shielding off external inteferences than a default Cat5e UTP cable?
For any non-offline projects, while having best software and best hardware, if your throughoutput is weak, then that thin line is going to squeeze data and give you delays, albeit in milliseconds...
Also the maximum throughput is higher on CAT6 cables because there is a higher chance that you will actually receive a cable whose strands exist of cupper cores, instead of CCA, Cupper Cladded Aluminium, which is often fount in all your standard CAT5e cables.
I if you are facing lost packets, packet drops, then an increase in throughput reliability for 24/7 operation can make the difference that you may be looking for.
For those who seek the ultimate in home/office connection reliability, (and are willing to say NO to this years fastfood restaurants, at the end of the year you can there you can) gift yourself the pinnacle of LAN connectivity in the form of CAT7 cable from a reputable brand.
Impossible to say. It depends on what the code looks like. If we can assume that the code already exists, then we can simply look at it and figure out from that, how to optimize it.
Better cache locality, loop unrolling, Try to eliminate long dependency chains, to get better instruction-level parallelism. Prefer conditional moves over branches when possible. Exploit SIMD instructions when possible.
Understand what your code is doing, and understand the hardware it's running on. Then it becomes fairly simple to determine what you need to do to improve performance of your code. That's really the only truly general piece of advice I can think of.
Well, that, and "Show the code on SO and ask for optimization advice for that specific piece of code".
If better hardware is an option then definitely go for that. Otherwise
Check you are using the best compiler and linker options.
If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
Consider: are you really using the best algorithm? Is it the best algorithm for your input size?
The google way is one option "Cache it.. Whenever possible don't touch the disk"
Here are some quick and dirty optimization techniques I use. I consider this to be a 'first pass' optimization.
Learn where the time is spent Find out exactly what is taking the time. Is it file IO? Is it CPU time? Is it the network? Is it the Database? It's useless to optimize for IO if that's not the bottleneck.
Know Your Environment Knowing where to optimize typically depends on the development environment. In VB6, for example, passing by reference is slower than passing by value, but in C and C++, by reference is vastly faster. In C, it is reasonable to try something and do something different if a return code indicates a failure, while in Dot Net, catching exceptions are much slower than checking for a valid condition before attempting.
Indexes Build indexes on frequently queried database fields. You can almost always trade space for speed.
Avoid lookups Inside of the loop to be optimized, I avoid having to do any lookups. Find the offset and/or index outside of the loop and reuse the data inside.
Minimize IO try to design in a manner that reduces the number of times you have to read or write especially over a networked connection
Reduce Abstractions The more layers of abstraction the code has to work through, the slower it is. Inside the critical loop, reduce abstractions (e.g. reveal lower-level methods that avoid extra code)
Spawn Threads for projects with a user interface, spawning a new thread to preform slower tasks makes the application feel more responsive, although isn't.
Pre-process You can generally trade space for speed. If there are calculations or other intense operations, see if you can precompute some of the information before you're in the critical loop.
If you have a lot of highly parallel floating point math-especially single-precision-try offloading it to a graphics processor (if one is present) using OpenCL or (for NVidia chips) CUDA. GPUs have immense floating point computing power in their shaders, which is much greater than that of a CPU.
Adding this answer since I didnt see it included in all the others.
Minimize implicit conversion between types and sign:
This applies to C/C++ at least, Even if you already think you're free of conversions - sometimes its good to test adding compiler warnings around functions that require performance, especially watch-out for conversions within loops.
GCC spesific: You can test this by adding some verbose pragmas around your code,
#ifdef __GNUC__
# pragma GCC diagnostic push
# pragma GCC diagnostic error "-Wsign-conversion"
# pragma GCC diagnostic error "-Wdouble-promotion"
# pragma GCC diagnostic error "-Wsign-compare"
# pragma GCC diagnostic error "-Wconversion"
#endif
/* your code */
#ifdef __GNUC__
# pragma GCC diagnostic pop
#endif
I've seen cases where you can get a few percent speedup by reducing conversions raised by warnings like this.
In some cases I have a header with strict warnings that I keep included to prevent accidental conversions, however this is a trade-off since you may end up adding a lot of casts to quiet intentional conversions which may just make the code more cluttered for minimal gains.
Sometimes changing the layout of your data can help. In C, you might switch from an array or structures to a structure of arrays, or vice versa.
Tweak the OS and framework.
It may sound an overkill but think about it like this: Operating Systems and Frameworks are designed to do many things. Your application only does very specific things. If you could get the OS do to exactly what your application needs and have your application understand how the the framework (php,.net,java) works, you could get much better out of your hardware.
Facebook, for example, changed some kernel level thingys in Linux, changed how memcached works (for example they wrote a memcached proxy, and used udp instead of tcp).
Another example for this is Window2008. Win2K8 has a version were you can install just the basic OS needed to run X applicaions (e.g. Web-Apps, Server Apps). This reduces much of the overhead that the OS have on running processes and gives you better performance.
Of course, you should always throw in more hardware as the first step...

Optimization! - What is it? How is it done?

Its common to hear about "highly optimized code" or some developer needing to optimize theirs and whatnot. However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Care to explain the general idea of it? Also, recommend some reading materials and really whatever you feel like saying on the matter. Feel free to rant and preach.
Optimize is a term we use lazily to mean "make something better in a certain way". We rarely "optimize" something - more, we just improve it until it meets our expectations.
Optimizations are changes we make in the hopes to optimize some part of the program. A fully optimized program usually means that the developer threw readability out the window and has recoded the algorithm in non-obvious ways to minimize "wall time". (It's not a requirement that "optimized code" be hard to read, it's just a trend.)
One can optimize for:
Memory consumption - Make a program or algorithm's runtime size smaller.
CPU consumption - Make the algorithm computationally less intensive.
Wall time - Do whatever it takes to make something faster
Readability - Instead of making your app better for the computer, you can make it easier for humans to read it.
Some common (and overly generalized) techniques to optimize code include:
Change the algorithm to improve performance characteristics. If you have an algorithm that takes O(n^2) time or space, try to replace that algorithm with one that takes O(n * log n).
To relieve memory consumption, go through the code and look for wasted memory. For example, if you have a string intensive app you can switch to using Substring References (where a reference contains a pointer to the string, plus indices to define its bounds) instead of allocating and copying memory from the original string.
To relieve CPU consumption, cache as many intermediate results if you can. For example, if you need to calculate the standard deviation of a set of data, save that single numerical result instead looping through the set each time you need to know the std dev.
I'll mostly rant with no practical advice.
Measure First. Optimization should be done to places where it matters. Highly optimized code is often difficult to maintain and a source of problems. In places where the code does not slow down execution anyway, I alwasy prefer maintainability to optimizations. Familiarize yourself with Profiling, both intrusive (instrumented) and non-intrusive (low overhead statistical). Learn to read a profiled stack, understand where the time inclusive/time exclusive is spent, why certain patterns show up and how to identify the trouble spots.
You can't fix what you cannot measure. Have your program report through some performance infrastructure the thing it does and the times it takes. I come from a Win32 background so I'm used to the Performance Counters and I'm extremely generous at sprinkling them all over my code. I even automatized the code to generate them.
And finally some words about optimizations. Most discussion about optimization I see focus on stuff any compiler will optimize for you for free. In my experience the greatest source of gains for 'highly optimized code' lies completely elsewhere: memory access. On modern architectures the CPU is idling most of the times, waiting for memory to be served into its pipelines. Between L1 and L2 cache misses, TLB misses, NUMA cross-node access and even GPF that must fetch the page from disk, the memory access pattern of a modern application is the single most important optimization one can make. I'm exaggerating slightly, of course there will be counter example work-loads that will not benefit memory access locality this techniques. But most application will. To be specific, what these techniques mean is simple: cluster your data in memory so that a single CPU can work an a tight memory range containing all it needs, no expensive referencing of memory outside your cache lines or your current page. In practice this can mean something as simple as accessing an array by rows rather than by columns.
I would recommend you read up the Alpha-Sort paper presented at the VLDB conference in 1995. This paper presented how cache sensitive algorithms designed specifically for modern CPU architectures can blow out of the water the old previous benchmarks:
We argue that modern architectures
require algorithm designers to
re-examine their use of the memory
hierarchy. AlphaSort uses clustered
data structures to get good cache
locality...
The general idea is that when you create your source tree in the compilation phase, before generating the code by parsing it, you do an additional step (optimization) where, based on certain heuristics, you collapse branches together, delete branches that aren't used or add extra nodes for temporary variables that are used multiple times.
Think of stuff like this piece of code:
a=(b+c)*3-(b+c)
which gets translated into
-
* +
+ 3 b c
b c
To a parser it would be obvious that the + node with its 2 descendants are identical, so they would be merged into a temp variable, t, and the tree would be rewritten:
-
* t
t 3
Now an even better parser would see that since t is an integer, the tree could be further simplified to:
*
t 2
and the intermediary code that you'd run your code generation step on would finally be
int t=b+c;
a=t*2;
with t marked as a register variable, which is exactly what would be written for assembly.
One final note: you can optimize for more than just run time speed. You can also optimize for memory consumption, which is the opposite. Where unrolling loops and creating temporary copies would help speed up your code, they would also use more memory, so it's a trade off on what your goal is.
Here is an example of some optimization (fixing a poorly made decision) that I did recently. Its very basic, but I hope it illustrates that good gains can be made even from simple changes, and that 'optimization' isn't magic, its just about making the best decisions to accomplish the task at hand.
In an application I was working on there were several LinkedList data structures that were being used to hold various instances of foo.
When the application was in use it was very frequently checking to see if the LinkedListed contained object X. As the ammount of X's started to grow, I noticed that the application was performing more slowly than it should have been.
I ran an profiler, and realized that each 'myList.Contains(x)' call had O(N) because the list has to iterate through each item it contains until it reaches the end or finds a match. This was definitely not efficent.
So what did I do to optimize this code? I switched most of the LinkedList datastructures to HashSets, which can do a '.Contains(X)' call in O(1)- much better.
This is a good question.
Usually the best practice is 1) just write the code to do what you need it to do, 2) then deal with performance, but only if it's an issue. If the program is "fast enough" it's not an issue.
If the program is not fast enough (like it makes you wait) then try some performance tuning. Performance tuning is not like programming. In programming, you think first and then do something. In performance tuning, thinking first is a mistake, because that is guessing.
Don't guess what to fix; diagnose what the program is doing.
Everybody knows that, but mostly they do it anyway.
It is natural to say "Could be the problem is X, Y, or Z" but only the novice acts on guesses. The pro says "but I'm probably wrong".
There are different ways to diagnose performance problems.
The simplest is just to single-step through the program at the assembly-language level, and don't take any shortcuts. That way, if the program is doing unnecessary things, then you are doing the same things, and it will become painfully obvious.
Another is to get a profiling tool, and as others say, measure, measure, measure.
Personally I don't care for measuring. I think it's a fuzzy microscope for the purpose of pinpointing performance problems. I prefer this method, and this is an example of its use.
Good luck.
ADDED: I think you will find, if you go through this exercise a few times, you will learn what coding practices tend to result in performance problems, and you will instinctively avoid them. (This is subtly different from "premature optimization", which is assuming at the beginning that you must be concerned about performance. In fact, you will probably learn, if you don't already know, that premature concern about performance can well cause the very problem it seeks to avoid.)
Optimizing a program means: make it run faster
The only way of making the program faster is making it do less:
find an algorithm that uses fewer operations (e.g. N log N instead of N^2)
avoid slow components of your machine (keep objects in cache instead of in main memory, or in main memory instead of on disk); reducing memory consumption nearly always helps!
Further rules:
In looking for optimization opportunities, adhere to the 80-20-rule: 20% of typical program code accounts for 80% of execution time.
Measure the time before and after every attempted optimization; often enough, optimizations don't.
Only optimize after the program runs correctly!
Also, there are ways to make a program appear to be faster:
separate GUI event processing from back-end tasks; priorize user-visible changes against back-end calculation to keep the front-end "snappy"
give the user something to read while performing long operations (every noticed the slideshows displayed by installers?)
However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Let me share a secret with you: nobody does. There are certain areas where we know mathematically what is and isn't slow. But for the most part, performance is too complicated to be able to understand. If you speed up one part of your code, there's a good possibility you're slowing down another.
Therefore, anyone who tells you that one method is faster than another, there's a good possibility they're just guessing unless one of three things are true:
They have data
They're choosing an algorithm that they know is faster mathematically.
They're choosing a data structure that they know is faster mathematically.
Optimization means trying to improve computer programs for such things as speed. The question is very broad, because optimization can involve compilers improving programs for speed, or human beings doing the same.
I suggest you read a bit of theory first (from books, or Google for lecture slides):
Data structures and algorithms - what the O() notation is, how to calculate it,
what datastructures and algorithms can be used to lower the O-complexity
Book: Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest
Compilers and assembly - how code is translated to machine instructions
Computer architecture - how the CPU, RAM, Cache, Branch predictions, out of order execution ... work
Operating systems - kernel mode, user mode, scheduling processes/threads, mutexes, semaphores, message queues
After reading a bit of each, you should have a basic grasp of all the different aspects of optimization.
Note: I wiki-ed this so people can add book recommendations.
I am going with the idea that optimizing a code is to get the same results in less time. And fully optimized only means they ran out of ideas to make it faster. I throw large buckets of scorn on claims of "fully optimized" code! There's no such thing.
So you want to make your application/program/module run faster? First thing to do (as mentioned earlier) is measure also known as profiling. Do not guess where to optimize. You are not that smart and you will be wrong. My guesses are wrong all the time and large portions of my year are spent profiling and optimizing. So get the computer to do it for you. For PC VTune is a great profiler. I think VS2008 has a built in profiler, but I haven't looked into it. Otherwise measure functions and large pieces of code with performance counters. You'll find sample code for using performance counters on MSDN.
So where are your cycles going? You are probably waiting for data coming from main memory. Go read up on L1 & L2 caches. Understanding how the cache works is half the battle. Hint: Use tight, compact structures that will fit more into a cache-line.
Optimization is lots of fun. And it's never ending too :)
A great book on optimization is Write Great Code: Understanding the Machine by Randall Hyde.
Make sure your application produces correct results before you start optimizing it.

One could use a profiler, but why not just halt the program? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
If something is making a single-thread program take, say, 10 times as long as it should, you could run a profiler on it. You could also just halt it with a "pause" button, and you'll see exactly what it's doing.
Even if it's only 10% slower than it should be, if you halt it more times, before long you'll see it repeatedly doing the unnecessary thing. Usually the problem is a function call somewhere in the middle of the stack that isn't really needed. This doesn't measure the problem, but it sure does find it.
Edit: The objections mostly assume that you only take 1 sample. If you're serious, take 10. Any line of code causing some percentage of wastage, like 40%, will appear on the stack on that fraction of samples, on average. Bottlenecks (in single-thread code) can't hide from it.
EDIT: To show what I mean, many objections are of the form "there aren't enough samples, so what you see could be entirely spurious" - vague ideas about chance. But if something of any recognizable description, not just being in a routine or the routine being active, is in effect for 30% of the time, then the probability of seeing it on any given sample is 30%.
Then suppose only 10 samples are taken. The number of times the problem will be seen in 10 samples follows a binomial distribution, and the probability of seeing it 0 times is .028. The probability of seeing it 1 time is .121. For 2 times, the probability is .233, and for 3 times it is .267, after which it falls off. Since the probability of seeing it less than two times is .028 + .121 = .139, that means the probability of seeing it two or more times is 1 - .139 = .861. The general rule is if you see something you could fix on two or more samples, it is worth fixing.
In this case, the chance of seeing it in 10 samples is 86%. If you're in the 14% who don't see it, just take more samples until you do. (If the number of samples is increased to 20, the chance of seeing it two or more times increases to more than 99%.) So it hasn't been precisely measured, but it has been precisely found, and it's important to understand that it could easily be something that a profiler could not actually find, such as something involving the state of the data, not the program counter.
On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.
This technique can reveal more performance problems in 2 minutes than any other technique I know of.
Because sometimes it works, and sometimes it gives you completely wrong answers. A profiler has a far better record of finding the right answer, and it usually gets there faster.
Doing this manually can't really be called "quick" or "effective", but there are several profiling tools which do this automatically; also known as statistical profiling.
Callstack sampling is a very useful technique for profiling, especially when looking at a large, complicated codebase that could be spending its time in any number of places. It has the advantage of measuring the CPU's usage by wall-clock time, which is what matters for interactivity, and getting callstacks with each sample lets you see why a function is being called. I use it a lot, but I use automated tools for it, such as Luke Stackwalker and OProfile and various hardware-vendor-supplied things.
The reason I prefer automated tools over manual sampling for the work I do is statistical power. Grabbing ten samples by hand is fine when you've got one function taking up 40% of runtime, because on average you'll get four samples in it, and always at least one. But you need more samples when you have a flat profile, with hundreds of leaf functions, none taking more than 1.5% of the runtime.
Say you have a lake with many different kinds of fish. If 40% of the fish in the lake are salmon (and 60% "everything else"), then you only need to catch ten fish to know there's a lot of salmon in the lake. But if you have hundreds of different species of fish, and each species is individually no more than 1%, you'll need to catch a lot more than ten fish to be able to say "this lake is 0.8% salmon and 0.6% trout."
Similarly in the games I work on, there are several major systems each of which call dozens of functions in hundreds of different entities, and all of this happens 60 times a second. Some of those functions' time funnels into common operations (like malloc), but most of it doesn't, and in any case there's no single leaf that occupies more than 1000 μs per frame.
I can look at the trunk functions and see, "we're spending 10% of our time on collision", but that's not very helpful: I need to know exactly where in collision, so I know which functions to squeeze. Just "do less collision" only gets you so far, especially when it means throwing out features. I'd rather know "we're spending an average 600 μs/frame on cache misses in the narrow phase of the octree because the magic missile moves so fast and touches lots of cells," because then I can track down the exact fix: either a better tree, or slower missiles.
Manual sampling would be fine if there were a big 20% lump in, say, stricmp, but with our profiles that's not the case. Instead I have hundreds of functions that I need to get from, say, 0.6% of frame to 0.4% of frame. I need to shave 10 μs off every 50 μs function that is called 300 times per second. To get that kind of precision, I need more samples.
But at heart what Luke Stackwalker does is what you describe: every millisecond or so, it halts the program and records the callstack (including the precise instruction and line number of the IP). Some programs just need tens of thousands of samples to be usefully profiled.
(We've talked about this before, of course, but I figured this was a good place to summarize the debate.)
There's a difference between things that programmers actually do, and things that they recommend others do.
I know of lots of programmers (myself included) that actually use this method. It only really helps to find the most obvious of performance problems, but it's quick and dirty and it works.
But I wouldn't really tell other programmers to do it, because it would take me too long to explain all the caveats. It's far too easy to make an inaccurate conclusion based on this method, and there are many areas where it just doesn't work at all. (for example, that method doesn't reveal any code that is triggered by user input).
So just like using lie detectors in court, or the "goto" statement, we just don't recommend that you do it, even though they all have their uses.
I'm surprised by the religous tone on both sides.
Profiling is great, and certainly is a more refined and precise when you can do it. Sometimes you can't, and it's nice to have a trusty back-up. The pause technique is like the manual screwdriver you use when your power tool is too far away or the bateries have run-down.
Here is a short true story. An application (kind of a batch proccessing task) had been running fine in production for six months, suddenly the operators are calling developers because it is going "too slow". They aren't going to let us attach a sampling profiler in production! You have to work with the tools already installed. Without stopping the production process, just using Process Explorer, (which operators had already installed on the machine) we could see a snapshot of a thread's stack. You can glance at the top of the stack, dismiss it with the enter key and get another snapshot with another mouse click. You can easily get a sample every second or so.
It doesn't take long to see if the top of the stack is most often in the database client library DLL (waiting on the database), or in another system DLL (waiting for a system operation), or actually in some method of the application itself. In this case, if I remember right, we quickly noticed that 8 times out of 10 the application was in a system DLL file call reading or writing a network file. Sure enough recent "upgrades" had changed the performance characteristics of a file share. Without a quick and dirty and (system administrator sanctioned) approach to see what the application was doing in production, we would have spent far more time trying to measure the issue, than correcting the issue.
On the other hand, when performance requirements move beyond "good enough" to really pushing the envelope, a profiler becomes essential so that you can try to shave cycles from all of your closely-tied top-ten or twenty hot spots. Even if you are just trying to hold to a moderate performance requirement durring a project, when you can get the right tools lined-up to help you measure and test, and even get them integrated into your automated test process it can be fantasticly helpful.
But when the power is out (so to speak) and the batteries are dead, it's nice know how to use that manual screwdriver.
So the direct answer: Know what you can learn from halting the program, but don't be afraid of precision tools either. Most importantly know which jobs call for which tools.
Hitting the pause button during the execution of a program in "debug" mode might not provide the right data to perform any performance optimizations. To put it bluntly, it is a crude form of profiling.
If you must avoid using a profiler, a better bet is to use a logger, and then apply a slowdown factor to "guesstimate" where the real problem is. Profilers however, are better tools for guesstimating.
The reason why hitting the pause button in debug mode, may not give a real picture of application behavior is because debuggers introduce additional executable code that can slowdown certain parts of the application. One can refer to Mike Stall's blog post on possible reasons for application slowdown in a debugging environment. The post sheds light on certain reasons like too many breakpoints,creation of exception objects, unoptimized code etc. The part about unoptimized code is important - the "debug" mode will result in a lot of optimizations (usually code in-lining and re-ordering) being thrown out of the window, to enable the debug host (the process running your code) and the IDE to synchronize code execution. Therefore, hitting pause repeatedly in "debug" mode might be a bad idea.
If we take the question "Why isn't it better known?" then the answer is going to be subjective. Presumably the reason why it is not better known is because profiling provides a long term solution rather than a current problem solution. It isn't effective for multi-threaded applications and isn't effective for applications like games which spend a significant portion of its time rendering.
Furthermore, in single threaded applications if you have a method that you expect to consume the most run time, and you want to reduce the run-time of all other methods then it is going to be harder to determine which secondary methods to focus your efforts upon first.
Your process for profiling is an acceptable method that can and does work, but profiling provides you with more information and has the benefit of showing you more detailed performance improvements and regressions.
If you have well instrumented code then you can examine more than just the how long a particular method; you can see all the methods.
With profiling:
You can then rerun your scenario after each change to determine the degree of performance improvement/regression.
You can profile the code on different hardware configurations to determine if your production hardware is going to be sufficient.
You can profile the code under load and stress testing scenarios to determine how the volume of information impacts performance
You can make it easier for junior developers to visualise the impacts of their changes to your code because they can re-profile the code in six months time while you're off at the beach or the pub, or both. Beach-pub, ftw.
Profiling is given more weight because enterprise code should always have some degree of profiling because of the benefits it gives to the organisation of an extended period of time. The more important the code the more profiling and testing you do.
Your approach is valid and is another item is the toolbox of the developer. It just gets outweighed by profiling.
Sampling profilers are only useful when
You are monitoring a runtime with a small number of threads. Preferably one.
The call stack depth of each thread is relatively small (to reduce the incredible overhead in collecting a sample).
You are only concerned about wall clock time and not other meters or resource bottlenecks.
You have not instrumented the code for management and monitoring purposes (hence the stack dump requests)
You mistakenly believe removing a stack frame is an effective performance improvement strategy whether the inherent costs (excluding callees) are practically zero or not
You can't be bothered to learn how to apply software performance engineering day-to-day in your job
....
Stack trace snapshots only allow you to see stroboscopic x-rays of your application. You may require more accumulated knowledge which a profiler may give you.
The trick is knowing your tools well and choose the best for the job at hand.
These must be some trivial examples that you are working with to get useful results with your method. I can't think of a project where profiling was useful (by whatever method) that would have gotten decent results with your "quick and effective" method. The time it takes to start and stop some applications already puts your assertion of "quick" in question.
Again, with non-trivial programs the method you advocate is useless.
EDIT:
Regarding "why isn't it better known"?
In my experience code reviews avoid poor quality code and algorithms, and profiling would find these as well. If you wish to continue with your method that is great - but I think for most of the professional community this is so far down on the list of things to try that it will never get positive reinforcement as a good use of time.
It appears to be quite inaccurate with small sample sets and to get large sample sets would take lots of time that would have been better spent with other useful activities.
What if the program is in production and being used at the same time by paying clients or colleagues. A profiler allows you to observe without interferring (as much, because of course it will have a little hit too as per the Heisenberg principle).
Profiling can also give you much richer and more detailed accurate reports. This will be quicker in the long run.
EDIT 2008/11/25: OK, Vineet's response has finally made me see what the issue is here. Better late than never.
Somehow the idea got loose in the land that performance problems are found by measuring performance. That is confusing means with ends. Somehow I avoided this by single-stepping entire programs long ago. I did not berate myself for slowing it down to human speed. I was trying to see if it was doing wrong or unnecessary things. That's how to make software fast - find and remove unnecessary operations.
Nobody has the patience for single-stepping these days, but the next best thing is to pick a number of cycles at random and ask what their reasons are. (That's what the call stack can often tell you.) If a good percentage of them don't have good reasons, you can do something about it.
It's harder these days, what with threading and asynchrony, but that's how I tune software - by finding unnecessary cycles. Not by seeing how fast it is - I do that at the end.
Here's why sampling the call stack cannot give a wrong answer, and why not many samples are needed.
During the interval of interest, when the program is taking more time than you would like, the call stack exists continuously, even when you're not sampling it.
If an instruction I is on the call stack for fraction P(I) of that time, removing it from the program, if you could, would save exactly that much. If this isn't obvious, give it a bit of thought.
If the instruction shows up on M = 2 or more samples, out of N, its P(I) is approximately M/N, and is definitely significant.
The only way you can fail to see the instruction is to magically time all your samples for when the instruction is not on the call stack. The simple fact that it is present for a fraction of the time is what exposes it to your probes.
So the process of performance tuning is a simple matter of picking off instructions (mostly function call instructions) that raise their heads by turning up on multiple samples of the call stack. Those are the tall trees in the forest.
Notice that we don't have to care about the call graph, or how long functions take, or how many times they are called, or recursion.
I'm against obfuscation, not against profilers. They give you lots of statistics, but most don't give P(I), and most users don't realize that that's what matters.
You can talk about forests and trees, but for any performance problem that you can fix by modifying code, you need to modify instructions, specifically instructions with high P(I). So you need to know where those are, preferably without playing Sherlock Holmes. Stack sampling tells you exactly where they are.
This technique is harder to employ in multi-thread, event-driven, or systems in production. That's where profilers, if they would report P(I), could really help.
Stepping through code is great for seeing the nitty-gritty details and troubleshooting algorithms. It's like looking at a tree really up close and following each vein of bark and branch individually.
Profiling lets you see the big picture, and quickly identify trouble points -- like taking a step backwards and looking at the whole forest and noticing the tallest trees. By sorting your function calls by length of execution time, you can quickly identify the areas that are the trouble points.
I used this method for Commodore 64 BASIC many years ago. It is surprising how well it works.
I've typically used it on real-time programs that were overrunning their timeslice. You can't manually stop and restart code that has to run 60 times every second.
I've also used it to track down the bottleneck in a compiler I had written. You wouldn't want to try to break such a program manually, because you really have no way of knowing if you are breaking at the spot where the bottlenck is, or just at the spot after the bottleneck when the OS is allowed back in to stop it. Also, what if the major bottleneck is something you can't do anything about, but you'd like to get rid of all the other largeish bottlenecks in the system? How to you prioritize which bottlenecks to attack first, when you don't have good data on where they all are, and what their relative impact each is?
The larger your program gets, the more useful a profiler will be. If you need to optimize a program which contains thousands of conditional branches, a profiler can be indispensible. Feed in your largest sample of test data, and when it's done import the profiling data into Excel. Then you check your assumptions about likely hot spots against the actual data. There are always surprises.

Resources