One could use a profiler, but why not just halt the program? [closed]

One could use a profiler, but why not just halt the program? [closed] - performance

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
If something is making a single-thread program take, say, 10 times as long as it should, you could run a profiler on it. You could also just halt it with a "pause" button, and you'll see exactly what it's doing.
Even if it's only 10% slower than it should be, if you halt it more times, before long you'll see it repeatedly doing the unnecessary thing. Usually the problem is a function call somewhere in the middle of the stack that isn't really needed. This doesn't measure the problem, but it sure does find it.
Edit: The objections mostly assume that you only take 1 sample. If you're serious, take 10. Any line of code causing some percentage of wastage, like 40%, will appear on the stack on that fraction of samples, on average. Bottlenecks (in single-thread code) can't hide from it.
EDIT: To show what I mean, many objections are of the form "there aren't enough samples, so what you see could be entirely spurious" - vague ideas about chance. But if something of any recognizable description, not just being in a routine or the routine being active, is in effect for 30% of the time, then the probability of seeing it on any given sample is 30%.
Then suppose only 10 samples are taken. The number of times the problem will be seen in 10 samples follows a binomial distribution, and the probability of seeing it 0 times is .028. The probability of seeing it 1 time is .121. For 2 times, the probability is .233, and for 3 times it is .267, after which it falls off. Since the probability of seeing it less than two times is .028 + .121 = .139, that means the probability of seeing it two or more times is 1 - .139 = .861. The general rule is if you see something you could fix on two or more samples, it is worth fixing.
In this case, the chance of seeing it in 10 samples is 86%. If you're in the 14% who don't see it, just take more samples until you do. (If the number of samples is increased to 20, the chance of seeing it two or more times increases to more than 99%.) So it hasn't been precisely measured, but it has been precisely found, and it's important to understand that it could easily be something that a profiler could not actually find, such as something involving the state of the data, not the program counter.

On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.
This technique can reveal more performance problems in 2 minutes than any other technique I know of.

Because sometimes it works, and sometimes it gives you completely wrong answers. A profiler has a far better record of finding the right answer, and it usually gets there faster.

Doing this manually can't really be called "quick" or "effective", but there are several profiling tools which do this automatically; also known as statistical profiling.

Callstack sampling is a very useful technique for profiling, especially when looking at a large, complicated codebase that could be spending its time in any number of places. It has the advantage of measuring the CPU's usage by wall-clock time, which is what matters for interactivity, and getting callstacks with each sample lets you see why a function is being called. I use it a lot, but I use automated tools for it, such as Luke Stackwalker and OProfile and various hardware-vendor-supplied things.
The reason I prefer automated tools over manual sampling for the work I do is statistical power. Grabbing ten samples by hand is fine when you've got one function taking up 40% of runtime, because on average you'll get four samples in it, and always at least one. But you need more samples when you have a flat profile, with hundreds of leaf functions, none taking more than 1.5% of the runtime.
Say you have a lake with many different kinds of fish. If 40% of the fish in the lake are salmon (and 60% "everything else"), then you only need to catch ten fish to know there's a lot of salmon in the lake. But if you have hundreds of different species of fish, and each species is individually no more than 1%, you'll need to catch a lot more than ten fish to be able to say "this lake is 0.8% salmon and 0.6% trout."
Similarly in the games I work on, there are several major systems each of which call dozens of functions in hundreds of different entities, and all of this happens 60 times a second. Some of those functions' time funnels into common operations (like malloc), but most of it doesn't, and in any case there's no single leaf that occupies more than 1000 μs per frame.
I can look at the trunk functions and see, "we're spending 10% of our time on collision", but that's not very helpful: I need to know exactly where in collision, so I know which functions to squeeze. Just "do less collision" only gets you so far, especially when it means throwing out features. I'd rather know "we're spending an average 600 μs/frame on cache misses in the narrow phase of the octree because the magic missile moves so fast and touches lots of cells," because then I can track down the exact fix: either a better tree, or slower missiles.
Manual sampling would be fine if there were a big 20% lump in, say, stricmp, but with our profiles that's not the case. Instead I have hundreds of functions that I need to get from, say, 0.6% of frame to 0.4% of frame. I need to shave 10 μs off every 50 μs function that is called 300 times per second. To get that kind of precision, I need more samples.
But at heart what Luke Stackwalker does is what you describe: every millisecond or so, it halts the program and records the callstack (including the precise instruction and line number of the IP). Some programs just need tens of thousands of samples to be usefully profiled.
(We've talked about this before, of course, but I figured this was a good place to summarize the debate.)

There's a difference between things that programmers actually do, and things that they recommend others do.
I know of lots of programmers (myself included) that actually use this method. It only really helps to find the most obvious of performance problems, but it's quick and dirty and it works.
But I wouldn't really tell other programmers to do it, because it would take me too long to explain all the caveats. It's far too easy to make an inaccurate conclusion based on this method, and there are many areas where it just doesn't work at all. (for example, that method doesn't reveal any code that is triggered by user input).
So just like using lie detectors in court, or the "goto" statement, we just don't recommend that you do it, even though they all have their uses.

I'm surprised by the religous tone on both sides.
Profiling is great, and certainly is a more refined and precise when you can do it. Sometimes you can't, and it's nice to have a trusty back-up. The pause technique is like the manual screwdriver you use when your power tool is too far away or the bateries have run-down.
Here is a short true story. An application (kind of a batch proccessing task) had been running fine in production for six months, suddenly the operators are calling developers because it is going "too slow". They aren't going to let us attach a sampling profiler in production! You have to work with the tools already installed. Without stopping the production process, just using Process Explorer, (which operators had already installed on the machine) we could see a snapshot of a thread's stack. You can glance at the top of the stack, dismiss it with the enter key and get another snapshot with another mouse click. You can easily get a sample every second or so.
It doesn't take long to see if the top of the stack is most often in the database client library DLL (waiting on the database), or in another system DLL (waiting for a system operation), or actually in some method of the application itself. In this case, if I remember right, we quickly noticed that 8 times out of 10 the application was in a system DLL file call reading or writing a network file. Sure enough recent "upgrades" had changed the performance characteristics of a file share. Without a quick and dirty and (system administrator sanctioned) approach to see what the application was doing in production, we would have spent far more time trying to measure the issue, than correcting the issue.
On the other hand, when performance requirements move beyond "good enough" to really pushing the envelope, a profiler becomes essential so that you can try to shave cycles from all of your closely-tied top-ten or twenty hot spots. Even if you are just trying to hold to a moderate performance requirement durring a project, when you can get the right tools lined-up to help you measure and test, and even get them integrated into your automated test process it can be fantasticly helpful.
But when the power is out (so to speak) and the batteries are dead, it's nice know how to use that manual screwdriver.
So the direct answer: Know what you can learn from halting the program, but don't be afraid of precision tools either. Most importantly know which jobs call for which tools.

Hitting the pause button during the execution of a program in "debug" mode might not provide the right data to perform any performance optimizations. To put it bluntly, it is a crude form of profiling.
If you must avoid using a profiler, a better bet is to use a logger, and then apply a slowdown factor to "guesstimate" where the real problem is. Profilers however, are better tools for guesstimating.
The reason why hitting the pause button in debug mode, may not give a real picture of application behavior is because debuggers introduce additional executable code that can slowdown certain parts of the application. One can refer to Mike Stall's blog post on possible reasons for application slowdown in a debugging environment. The post sheds light on certain reasons like too many breakpoints,creation of exception objects, unoptimized code etc. The part about unoptimized code is important - the "debug" mode will result in a lot of optimizations (usually code in-lining and re-ordering) being thrown out of the window, to enable the debug host (the process running your code) and the IDE to synchronize code execution. Therefore, hitting pause repeatedly in "debug" mode might be a bad idea.

If we take the question "Why isn't it better known?" then the answer is going to be subjective. Presumably the reason why it is not better known is because profiling provides a long term solution rather than a current problem solution. It isn't effective for multi-threaded applications and isn't effective for applications like games which spend a significant portion of its time rendering.
Furthermore, in single threaded applications if you have a method that you expect to consume the most run time, and you want to reduce the run-time of all other methods then it is going to be harder to determine which secondary methods to focus your efforts upon first.
Your process for profiling is an acceptable method that can and does work, but profiling provides you with more information and has the benefit of showing you more detailed performance improvements and regressions.
If you have well instrumented code then you can examine more than just the how long a particular method; you can see all the methods.
With profiling:
You can then rerun your scenario after each change to determine the degree of performance improvement/regression.
You can profile the code on different hardware configurations to determine if your production hardware is going to be sufficient.
You can profile the code under load and stress testing scenarios to determine how the volume of information impacts performance
You can make it easier for junior developers to visualise the impacts of their changes to your code because they can re-profile the code in six months time while you're off at the beach or the pub, or both. Beach-pub, ftw.
Profiling is given more weight because enterprise code should always have some degree of profiling because of the benefits it gives to the organisation of an extended period of time. The more important the code the more profiling and testing you do.
Your approach is valid and is another item is the toolbox of the developer. It just gets outweighed by profiling.

Sampling profilers are only useful when
You are monitoring a runtime with a small number of threads. Preferably one.
The call stack depth of each thread is relatively small (to reduce the incredible overhead in collecting a sample).
You are only concerned about wall clock time and not other meters or resource bottlenecks.
You have not instrumented the code for management and monitoring purposes (hence the stack dump requests)
You mistakenly believe removing a stack frame is an effective performance improvement strategy whether the inherent costs (excluding callees) are practically zero or not
You can't be bothered to learn how to apply software performance engineering day-to-day in your job
....

Stack trace snapshots only allow you to see stroboscopic x-rays of your application. You may require more accumulated knowledge which a profiler may give you.
The trick is knowing your tools well and choose the best for the job at hand.

These must be some trivial examples that you are working with to get useful results with your method. I can't think of a project where profiling was useful (by whatever method) that would have gotten decent results with your "quick and effective" method. The time it takes to start and stop some applications already puts your assertion of "quick" in question.
Again, with non-trivial programs the method you advocate is useless.
EDIT:
Regarding "why isn't it better known"?
In my experience code reviews avoid poor quality code and algorithms, and profiling would find these as well. If you wish to continue with your method that is great - but I think for most of the professional community this is so far down on the list of things to try that it will never get positive reinforcement as a good use of time.
It appears to be quite inaccurate with small sample sets and to get large sample sets would take lots of time that would have been better spent with other useful activities.

What if the program is in production and being used at the same time by paying clients or colleagues. A profiler allows you to observe without interferring (as much, because of course it will have a little hit too as per the Heisenberg principle).
Profiling can also give you much richer and more detailed accurate reports. This will be quicker in the long run.

EDIT 2008/11/25: OK, Vineet's response has finally made me see what the issue is here. Better late than never.
Somehow the idea got loose in the land that performance problems are found by measuring performance. That is confusing means with ends. Somehow I avoided this by single-stepping entire programs long ago. I did not berate myself for slowing it down to human speed. I was trying to see if it was doing wrong or unnecessary things. That's how to make software fast - find and remove unnecessary operations.
Nobody has the patience for single-stepping these days, but the next best thing is to pick a number of cycles at random and ask what their reasons are. (That's what the call stack can often tell you.) If a good percentage of them don't have good reasons, you can do something about it.
It's harder these days, what with threading and asynchrony, but that's how I tune software - by finding unnecessary cycles. Not by seeing how fast it is - I do that at the end.
Here's why sampling the call stack cannot give a wrong answer, and why not many samples are needed.
During the interval of interest, when the program is taking more time than you would like, the call stack exists continuously, even when you're not sampling it.
If an instruction I is on the call stack for fraction P(I) of that time, removing it from the program, if you could, would save exactly that much. If this isn't obvious, give it a bit of thought.
If the instruction shows up on M = 2 or more samples, out of N, its P(I) is approximately M/N, and is definitely significant.
The only way you can fail to see the instruction is to magically time all your samples for when the instruction is not on the call stack. The simple fact that it is present for a fraction of the time is what exposes it to your probes.
So the process of performance tuning is a simple matter of picking off instructions (mostly function call instructions) that raise their heads by turning up on multiple samples of the call stack. Those are the tall trees in the forest.
Notice that we don't have to care about the call graph, or how long functions take, or how many times they are called, or recursion.
I'm against obfuscation, not against profilers. They give you lots of statistics, but most don't give P(I), and most users don't realize that that's what matters.
You can talk about forests and trees, but for any performance problem that you can fix by modifying code, you need to modify instructions, specifically instructions with high P(I). So you need to know where those are, preferably without playing Sherlock Holmes. Stack sampling tells you exactly where they are.
This technique is harder to employ in multi-thread, event-driven, or systems in production. That's where profilers, if they would report P(I), could really help.

Stepping through code is great for seeing the nitty-gritty details and troubleshooting algorithms. It's like looking at a tree really up close and following each vein of bark and branch individually.
Profiling lets you see the big picture, and quickly identify trouble points -- like taking a step backwards and looking at the whole forest and noticing the tallest trees. By sorting your function calls by length of execution time, you can quickly identify the areas that are the trouble points.

I used this method for Commodore 64 BASIC many years ago. It is surprising how well it works.

I've typically used it on real-time programs that were overrunning their timeslice. You can't manually stop and restart code that has to run 60 times every second.
I've also used it to track down the bottleneck in a compiler I had written. You wouldn't want to try to break such a program manually, because you really have no way of knowing if you are breaking at the spot where the bottlenck is, or just at the spot after the bottleneck when the OS is allowed back in to stop it. Also, what if the major bottleneck is something you can't do anything about, but you'd like to get rid of all the other largeish bottlenecks in the system? How to you prioritize which bottlenecks to attack first, when you don't have good data on where they all are, and what their relative impact each is?

The larger your program gets, the more useful a profiler will be. If you need to optimize a program which contains thousands of conditional branches, a profiler can be indispensible. Feed in your largest sample of test data, and when it's done import the profiling data into Excel. Then you check your assumptions about likely hot spots against the actual data. There are always surprises.

Related

Do (sampling) profilers still "lie" these days?

Most of my limited experience with profiling native code is on a GPU rather than on a CPU, but I see some CPU profiling in my future...
Now, I've just read this blog post:
How profilers lie: The case of gprof and KCacheGrind
about how what profilers measure and what they show you, which is likely not what you expect if you're interested in discerning between different call paths and the time spent in them.
My question is: Is this still the case today (5 years later)? That is, do sampling profilers (i.e. those who don't slow execution down terribly) still behave the way gprof used to (or callgrind without --separate-callers=N)? Or do profilers nowadays customarily record the entire call stack when sampling?

No, many modern sampling profilers don't exhibit the problem described regarding gprof.
In fact, even when that was written, the specific problem was actually more a quirk of the way gprof uses a mix of instrumentation and sampling and then tries to reconstruct a hypothetical call graph based on limited caller/callee information and combine that with the sampled timing information.
Modern sampling profilers, such as perf, VTune, and various language-specific profilers to languages that don't compile to native code can capture the full call stack with each sample, which provides accurate times with respect to that issue. Alternately, you might sample without collecting call stacks (which reduces greatly the sampling cost) and then present the information without any caller/callee information which would still be accurate.
This was largely true even in the past, so I think it's fair to say that sampling profilers never, as a group, really exhibited that problem.
Of course, there are still various ways in which profilers can lie. For example, getting results accurate to the instruction level is a very tricky problem, given modern CPUs with 100s of instructions in flight at once, possibly across many functions, and complex performance models where instructions may have a very different in-context cost as compared to their nominal latency and throughput values. Even that tricky issues can be helped with "hardware assist" such as on recent x86 chips with PEBS support and later related features that help you pin-point an instruction in a less biased way.

Regarding gprof, yes, it's still the case today. This is by design, to keep the profiling overhead small. From the up-to-date documentation:
Some of the figures in the call graph are estimates—for example, the
children time values and all the time figures in caller and subroutine
lines.
There is no direct information about these measurements in the profile
data itself. Instead, gprof estimates them by making an assumption
about your program that might or might not be true.
The assumption made is that the average time spent in each call to any
function foo is not correlated with who called foo. If foo used 5
seconds in all, and 2/5 of the calls to foo came from a, then foo
contributes 2 seconds to a’s children time, by assumption.
Regarding KCacheGrind, little has changed since the article was written. You can check out the change log and see that the latest version was published in April 5, 2013, which includes unrelated changes. You can also refer to Josef Weidendorfer's comments under the article (Josef is the author of KCacheGrind).

If you noticed, I contributed several comments to that post you referenced, but it's not just that profilers give you bad information, it's that people fool themselves about what performance actually is.
What is your goal? Is it to A) find out how to make the program as fast as possible? Or is it to B) measure time taken by various functions, hoping that will lead to A? (Hint - it doesn't.) Here's a detailed list of the issues.
To illustrate: You could, for example, be calling a teeny innocent-looking little function somewhere that just happens to invoke nine yards of system code including reading a .dll to extract a string resource in order to internationalize it. This could be taking 50% of wall-clock time and therefore be on the stack 50% of wall-clock time. Would a "CPU-profiler" show it to you? No, because practically all of that 50% is doing I/O. Do you need many many stack samples to know to 3 decimal places exactly how much time it's taking? Of course not. If you only got 10 samples it would be on 5 of them, give or take. Once you know that teeny routine is a big problem, does that mean you're out of luck because somebody else wrote it? What if you knew what the string was that it was looking up? Does it really need to be internationalized, so much so that you're willing to pay a factor of two in slowness just for that? Do you see how useless measurements are when your real problem is to understand qualitatively what takes time?
I could go on and on with examples like this...

Profiling vs debugging what is the practice

Most often, I do wounder how best to make my applications have optimal performance. How to optimize and identify functions/methods that are more resource intensive than the others and make necessary adjustments. In software development irrespective of language I believe that, there should some ways of finding out how the processor/network resources are being used by different parts of my codes. I will illustrate what I mean using the simplest example I can think of: I have background on Java, Python and PHP and feel more comfortable working on linux environment. Please feel free to advice me using any of these languages you are comfortable with:
In Javascript one can comfortably test and assign a value to a variable by doing:
//METHOD 1:
if(true){
console.log("It will always be true");
}else{
console.log("You can never see me");
}
//METHOD 2:
var print;
if(true){
print="It will always be true";
}else{
print="You can never see me";
}
console.log(print);
//METHOD 3:
console.log((true)? "It will always be true" : "You can never see me");
If different people were to be asked which of these methods will perform faster than the other. I am sure that different individuals will come up with different ideas. But I need a more reliable way to know about resource usage both on desktop and mobile applications. Thanks.

First of all, Debugging is done when you know the exact bug or wrong functionality.
It is done for functional testing i.e. to check defects in application.
For performance, profilers are used to find out resource intensive methods. they will provide heavy modules/functions/DB queries etc. so that after analysis you can tune and improve your system performance.If after modification some defect arises then you debugger can be used to pinpoint and correct the issue.
There are many opensource as well as paid profilers for java(I am saying java because you pasted javascript code).
Please have a look at them and use them to tune your system.
IMHO downvoting was for 2 reasons bad english and very basic question.

In the examples you gave, it simply doesn't matter. That's like asking which is faster, a snail or a worm BUT, if by
"profiling" you mean "measurements of time taken by functions", and if by
"debugging performance problem" you mean "the bug is that time is being spent unnecessarily, and I need to find out why",
then I strongly agree with the premise of your question.
Debugging works much better.
A performance problem consists of excess time being spent for unnecessary reasons.
The way to find it is by breaking into the execution at random times, which will naturally gravitate toward whatever is taking the most time.
The more time the problem takes, the more likely the break is to land in the problem.
Just do it several times, and each time look carefully at the program's state, to understand why it's doing whatever it's doing.
If you see it doing something that can be avoided, and you see it doing it on more than one break, you've found a performance problem.
Fix it and observe the speedup.
Then repeat the process, because what was the next biggest problem is now the biggest one.
The speedups multiply together.
Here's an example where the time goes from 2700us to 1800, then to 1500, 1300, 440, 170, and finally 3.7us. None of the speedups were all that big as fractions of the original time, but in the aggregate, they were stunning.
This is very much different from measuring.
In measuring, the first thing that happens is you assume numerical accuracy is important, so you assume you either need lots of samples or you have to instrument the code to measure time precisely.
As if numerical accuracy helps to find the problem.
This false assumption entered programmers' consciousness around 1982, when GPROF appeared.
In measuring, you assume the problem can be localized to a function, when in fact you need to see all of what's happening at a point in time to know if it can be avoided.
IMHO, the best profilers sample the stack, on wall-clock time, and report for each line of code that appears on samples, the percent of samples it appears on.
However, even these profilers don't tell you the context that you can get by simply looking carefully at individual stack samples, and data as well.
(Other forms of eye-candy have the same problem: call-graph, hot-path, flame-graph, etc.)

Most hazardous performance bottleneck misconceptions

The guys who wrote Bespin (cloud-based canvas-based code editor [and more]) recently spoke about how they re-factored and optimize a portion of the Bespin code because of a misconception that JavaScript was slow. It turned out that when all was said and done, their optimization produced no significant improvements.
I'm sure many of us go out of our way to write "optimized" code based on misconceptions similar to that of the Bespin team.
What are some common performance bottleneck misconceptions developers commonly subscribe to?

In no particular order:
"Ready, Fire, Aim" - thinking you know what needs to be optimized without proving it (i.e. guessing) and then acting on that, and since it doesn't help much, therefore assuming the code must have been optimal to begin with.
"Penny Wise, Pound Foolish" - thinking that optimization is all about compiler optimization, fussing about ++i vs. i++ while mountains of time are being spent needlessly in overblown designs, especially of data structures and databases.
"Swat Flies With a Bazooka" - being so enamored of the fanciest ideas heard in classrooms that they are just used for everything, regardless of scale.
"Fuzzy Thinking about Performance" - throwing around terms like "hotspot" and "bottleneck" and "profiler" and "measure" as if these things were well understood and / or relevant. (I bet I get clobbered for that!) OK, one at a time:
hotspot - What's the definition? I have one: it is a region of physical addresses where the PC register is found a significant fraction of time. It is the kind of thing PC samplers are good at finding. Many performance problems exhibit hotspots, but only in the simplest programs is the problem in the same place as the hotspot is.
bottleneck - A catch-all term used for performance problems, it implies a limited channel constraining how fast work can be accomplished. The unstated assumption is that the work is necessary. In my decades of performance tuning, I have in fact found a few problems like that - very few. Nearly all are of a very different nature. Rather than taking the shortest route from point A to B, little detours are taken, in the form of function calls which take little code, but not little time. Then those detours take further nested detours, sometimes 30 levels deep. The more the detours are nested, the more likely it is that some of them are less than necessary - wasteful, in fact - and it nearly always arises from galloping generality - unquestioning over-indulgence in "abstraction".
profiler - a universal good thing, right? All you gotta do is get a profiler and do some profiling, right? Ever think about how easy it is to fool a profiler into telling you a lot of nothing, when your goal is to find out what you need to fix to get better performance? Somewhere in your call tree, tuck a little file I/O, or a little call to some system routine, or have your evil twin do it without your knowledge. At some point, that will be your problem, and most profilers will miss it completely because the only inefficiency they contemplate is algorithmic inefficiency. Or, not all your routines will be little, and they may not call another routine in a small number of places, so your call graph says there's a link between the two routines, but which call? Or suppose you can figure out that some big percent of time is spent in a path A calls B calls C. You can look at that and think there's not much you can do about it, when if you could also look at the data being passed in those calls, you could see if it's necesssary. Here's a fun project - pick your favorite profiler, and then see how many ways you could fool it. That is, find ways to make the program take longer without the profiler being able to tell what you did, because if you can do it intentionally, you can also do it without intending to.
measure - (i.e. measure time) that's what profilers have done for decades, and they take pride in the accuracy and precision with which they measure. But measure what time? and why with accuracy? Remember, the goal is to precisely locate performance problems, such that you could fruitfully optimize them to gain a speedup. When you get that speedup, it is what it is, regardless of how precisely you estimated it beforehand. If that precision of measurement is bought at the expense of precision of location, then you've just bought apples when what you needed was oranges.
Here's a list of myths about performance.

And this is what happens when one optimizes without a valid profile in hand. All you are doing without a profile is guessing and probably wasting time and money. I could list a bunch of other misconceptions, but many come down to the fact that if the code in question isn't a top resource consumer, it is probably fine as is. Kinda like unrolling a for loop that is doing disk I/O...

If I convert the whole code base over to [Insert xxx latest technology here], it'll be much faster.

Relational databases are slow.
I'm smarter than the optimizer.
This should be optimized.
Java is slow
And, unrelated:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-jwz

Optimizing the WRONG part of the code (people, use your profiler!).
The optimizer in my compiler is smart, so I don't have to help it.
Java is fast (LOL)
Relational databases are fast (ROTFL LOL LMAO)

"This has to be as fast as possible."
If you don't have a performance problem, you don't need to worry about optimizing performance (beyond just paying attention to using good algorithms).
This misconception also manifests in attempts to optimize performance for every single aspect of your program. I see this most often with people trying to shave every last millisecond off of a low-volume web application's execution time while failing to take into account that network latency will take orders of magnitude longer than their code's execution time, making any reduction in execution time irrelevant anyhow.

My rules of optimization.
Don't optimize
Don't optimize now.
Profile to identify the problem.
Optimize the component that is taking at least 80% of the time.
Find an optimization that is 10 times faster.
My best optimization has been reducing a report from 3 days to 9 minutes. The optimized code was sped up from three days to three minutes.
In my carreer I have met three people who had been tasked with producing a faster sort on VAX than the native sort. They invariably had been able to produce sorts that took only three times longer.

The rules are simple:
Try to use standard library functions first.
Try to use brute-force and ignorance second.
Prove you've got a problem before trying to do any optimization.

When is performance gain significant enough to implement that optimization?

following the text book, I do measure performance whenever I try optimizing my code. Sometimes, however, the performance gain is rather small and I can't decisively decide whether I should implement that optimization.
For example, when a fix shortens an average response time of 100ms to 90ms under some conditions, should I implement that fix? What if it shortens 200ms to 190ms? How many condition should I try before I can conclude that it will be beneficial overall?
I guess it's not possible to give a straight forward answer to this, as it depends on too many things, but is there a good rule of thumb that I should follow? Are there any guideline/best-practices?
EDIT:Thanks for the great answers! I guess the moral of the story is, there is no easy way to tell whether you should, but there ARE guidelines that can aid that process.. Things you should consider, things you shouldn't do etc. This particular time I ended up implementing the fix, even though it made a few line of code into 20-30 lines of code. Because our app. is very performance critical, and it was a consistent 10% gain in various realistic cases.

I think the rule of thumb (at least for me) is two-fold:
"It matters if it matters"--in the business world, this generally means that it matters if the clients care. That is, if the end users will "notice" the difference between 100ms and 90ms (I'm not being facetious here), then it matters.
If "it matters," then you will want to test your code thoroughly against a realistic variety of use cases that are likely to arise or at least may arise. If an optimization speeds up code in 50% of cases, but actually runs slower than what you previously had the other 50% of the time, obviously, it may not be worth implementing.
Regarding point 1 above: by suggesting an end user of your software might "notice" a 10ms difference, I don't mean to suggest that they will actually visibly see a difference. But if your app runs on a server with millions of connections and every little speed increase takes a substantial load off the server, that might matter to the client running the server. Or if your app performs extremely time-critical work, this is another case where the result of a 10ms speedup might be noticeable, even if the speedup itself isn't.

The only sensible approach to your question is something along the lines of "when the benefit is large enough to warrant the time you invest in exploring, implementing and testing the optimization."
The "benefit is large enough" is extremely subjective. Can you or your employer sell more units of software if you make this change? Will your user base notice? Will it give you personal gratification to have the fastest-possible code? Which of those or similar questions apply is something only you can know.
By and large, most of the software I have written (in a 20+ year career) has been "fast enough" out of the box, and the code I cared to optimize presented itself as an obvious bottleneck to the end users: Queries taking a long time, scrolling too slow, that sort of thing.

Donald Knuth made the following two statements on optimization:
"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil" [2]
and
"In established engineering
disciplines a 12 % improvement, easily
obtained, is never considered marginal
and I believe the same viewpoint
should prevail in software
engineering"[5]
src: http://en.wikipedia.org/wiki/Program_optimization

Is the optimization obfuscating your code too much?
Do you really need an optimization? if your app just runs fine then readability of the code is probably more important
Did you work on the general design and algorithms of your application before trying small hacky optimizations?

You should focus optimisation efforts on the parts of code that account for the most runtime. If a particular piece of code takes up 80% of the total runtime, then optimising it to reduce the time is takes by 5% will have as much impact as reducing the time of the rest of the code by 20%.
In general, optimisations make code less readable (not always, but often). Therefore you should avoid optimising until you are sure that there is a problem.

If it speeds up your program at all, why not implement it? You have already done the work by creating the new implementation, so you are not doing extra work by applying the new implementation.
Unless the code is THAT much harder to understand.
Also, 100 ms to 90 ms is a 10% gain in performance. A 10% gain should not be taken lightly.
The real question is, if it only took 100 ms to run in the first place, what was the point in trying to optimize it?

As long as it's fast enough, then you don't need to optimise any more. But then, you wouldn't even bother profiling if that was the case...

If the performance gain is small, consider the other factors: maintainability, risk of making the change, understandability, etc. If it reduces the ability to maintain or understand the code, it probably isn't worth doing. If it improves those attributes, then it's more reason to implement the change.

In most cases, your time is more valuable than the computer's. If you think it'll take you half an hour longer to work out what the code is doing later (say if there's a bug in it), and it's only saved you a few seconds, ever, you're at a net loss.

It depends very much on the usage scenario. I'll assume here that the code in question has been profiled and thus it is known to be the bottleneck--i.e. not just "this could be faster", but "the program would give results/finish running faster if this were faster". In situations where this is not the case--e.g. if you spend 99% of your time waiting for more data to come over an ethernet connection--then you should care about correctness but not optimize for speed.
If you are writing a piece of user interface code, what you care about is perceived speed. Generally anything under ~100 ms is perceived as "instant"--no point speeding it up.
If you are writing a piece of code for a giant server farm, then if the cost of your salary to make the code fast is less than the cost of the extra electricity for the server farm, it's worthwhile. (But be sure to prioritize your time.)
If you are writing a piece of code that is used rarely or when unattended, as long as it completes in a semi-sane duration, don't worry about it. Install scripts tend to be of this sort (unless you start running into many minutes, at which point users might start abandoning the install because it's taking too long).
If you are writing code to automate a task for someone else, then if (your time spent coding + their time spent using the optimized code) is less than (their time spent using the slow code), it's worthwhile. If you're doing this in a commercial setting, weight this by your respective salaries.
If you are writing library code that will be used by many thousands of people, always make it faster if you have time to.
If you are under time pressure to simply have something working e.g. as a demo, don't optimize (except through sensible choice of algorithms from libraries) unless the result would be so slow that it isn't even "working".
One of the biggest annoyances for me personally is finding software which perhaps initially fell into one category and then later fell into another, but for which nobody went back to do needed optimizations. Until recently Javascript performance was a great example of this. Moral of the story is: don't just decide once; revisit the issue as the situation demands.

When is the optimization really worth the time spent on it?

After my last question, I want to understand when an optimization is really worth the time a developer spends on it.
Is it worth spending 4 hours to have queries that are 20% quicker? Yes, no, maybe, yes if...?
Is 'wasting' 7 hours to switch a task to another language to save about 40% of the CPU usage 'worth it'?
My normal iteration for a new project is:
Understand what the customer wants, and what he needs;
Plan the project: what languages and where, database design;
Develop the project;
Test and bug-fix;
Final analysis of the running project and final optimizations;
If the project requires it, a further analysis on the real-life usage of the resources followed by further optimization;
"Write good, maintainable code" is implied.
Obviously, the big 'optimization' part happens at point #2, but often when reviewing the code after the project is over I find some sections that, even if they do their job well, could be improved. This is the rationale for point #5.
To give a concrete example of the last point, a simple example is when I expect 90% of queries to be SELECT and 10% to be INSERT/UPDATE, so I charge a DB table with indexes. But after 6 months, I see that in real-life there are 10% SELECT queries and 90% INSERT/UPDATEs, so the query speed is not optimized. This is the first example that comes to my mind (and obviously this is more a 'patch' to an initial mis-design than an optimization ;).
Please note that I'm a developer, not a business-man - but I like to have a clear conscience by giving my clients the best, where possible.
I mean, I know that if I lose 50 hours to gain 5% of an application's total speed-up, and the application is used by 10 users, then it maybe isn't worth the time... but what about when it is?
When do you think that an optimization is crucial?
What formula do you usually apply, aware that the time spent optimizing (and the final gain) is not always quantifiable on paper?
EDIT: sorry, but i cant accept an answer like 'untill people dont complain about id, optimization is not needed'; It can be a business-view (questionable, imho), but not an developer or (imho too) a good-sense answer. I know, this question is really subjective.
I agree with Cheeso, the performance optimization should be deferred, after some analysis about the real-life usage and load of the project, but a small'n'quick optimization can be done immediatly after the project is over.
Thanks to all ;)

YAGNI. Unless people complain, a lot.
EDIT : I built a library that was slightly slower than the alternatives out there. It was still gaining usage and share because it was nicer to use and more powerful. I continued to invest in features and capability, deferring any work on performance.
At some point, there were enough features and performance bubbled to the top of the list I finally spent some time working on perf improvement, but only after considering the effort for a long time.
I think this is the right way to approach it.

There are (at least) two categories of "efficiency" to mention here:
UI applications (and their dependencies), where the most important measure is the response time to the user.
Batch processing, where the main indicator is total running time.
In the first case, there are well-documented rules about response times. If you care about product quality, you need to keep response times short. The shorter the better, of course, but the breaking points are about:
100 ms for an "immediate" response; animation and other "real-time" activities need to happen at least this fast;
1 second for an "uninterrupted" response. Any more than this and users will be frustrated; you also need to start think about showing a progress screen past this point.
10 seconds for retaining user focus. Any worse than this and your users will be pissed off.
If you're finding that several operations are taking more than 10 seconds, and you can fix the performance problems with a sane amount of effort (I don't think there's a hard limit but personally I'd say definitely anything under 1 man-month and probably anything under 3-4 months), then you should definitely put the effort into fixing it.
Similarly, if you find yourself creeping past that 1-second threshold, you should be trying very hard to make it faster. At a minimum, compare the time it would take to improve the performance of your app with the time it would take to redo every slow screen with progress dialogs and background threads that the user can cancel - because it is your responsibility as a designer to provide that if the app is too slow.
But don't make a decision purely on that basis - the user experience matters too. If it'll take you 1 week to stick in some async progress dialogs and 3 weeks to get the running times under 1 second, I would still go with the latter. IMO, anything under a man-month is justifiable if the problem is application-wide; if it's just one report that's run relatively infrequently, I'd probably let it go.
If your application is real-time - graphics-related for example - then I would classify it the same way as the 10-second mark for non-realtime apps. That is, you need to make every effort possible to speed it up. Flickering is unacceptable in a game or in an image editor. Stutters and glitches are unacceptable in audio processing. Even for something as basic as text input, a 500 ms delay between the key being pressed and the character appearing is completely unacceptable unless you're connected via remote desktop or something. No amount of effort is too much for fixing these kinds of problems.
Now for the second case, which I think is mostly self-evident. If you're doing batch processing then you generally have a scalability concern. As long as the batch is able to run in the time allotted, you don't need to improve it. But if your data is growing, if the batch is supposed to run overnight and you start to see it creeping into the wee hours of the morning and interrupting people's work at 9:15 AM, then clearly you need to work on performance.
Actually, you really can't wait that long; once it fails to complete in the required time, you may already be in big trouble. You have to actively monitor the situation and maintain some sort of safety margin - say a maximum running time of 5 hours out of the available 6 before you start to worry.
So the answer for batch processes is obvious. You have a hard requirement that the bast must finish within a certain time. Therefore, if you are getting close to the edge, performance must be improved, regardless of how difficult/costly it is. The question then becomes what is the most economical means of improving the process?
If it costs significantly less to just throw some more hardware at the problem (and you know for a fact that the problem really does scale with hardware), then don't spend any time optimizing, just buy new hardware. Otherwise, figure out what combination of design optimization and hardware upgrades is going to get you the best ROI. It's almost purely a cost decision at this point.
That's about all I have to say on the subject. Shame on the people who respond to this with "YAGNI". It's your professional responsibility to know or at least find out whether or not you "need it." Assuming that anything is acceptable until customers complain is an abdication of this responsibility.
Simply because your customers don't demand it doesn't mean you don't need to consider it. Your customers don't demand unit tests, either, or even reasonably good/maintainable code, but you provide those things anyway because it is part of your profession. And at the end of the day, your customers will be a lot happier with a smooth, fast product than with any of those other developer-centric things.

Optimization is worth it when it is necessary.
If we have promised the client response times on holiday package searches that are 5 seconds or less and that the system will run on a single Oracle server (of whatever spec) and the searches are taking 30 seconds at peak load, then the optimization is definitely worth it because we're not going to get paid otherwise.
When you are initially developing a system, you (if you are a good developer) are designing things to be efficient without wasting time on premature optimization. If the resulting system isn't fast enough, you optimize. But your question seems to be suggesting that there's some hand-wavey additional optimization that you might do if you feel that it's worth it. That's not a good way to think about it because it implies that you haven't got a firm target in mind for what is acceptable to begin with. You need to discuss it with the stakeholders and set some kind of target before you start worrying about what kind of optimizations you need to do.

Like everyone said in the other questions answers is, when it makes monetary sense to change something then it needs changing. In most cases good enough wins the day. If the customers aren't complaining then it is good enough. If they are complaining then fix it enough so they stop complaining. Agile methodologies will give you some guidance on how to know when enough is enough. Who cares if something is using 40% CPU more CPU than you think it needs to be, if it is working and the customers are happy then it is good enough. Really simple, get it working and maintainable and then wait for complaints that probably will never come.
If what you are worried about was really a problem, NO ONE would ever have started using Java to build mission critical server side applications. Or Python or Erlang or anything else that isn't C for that matter. And if they did that, nothing would get done in a time frame to even acquire that first customer that you are so worried about losing. You will know well in advance that you need to change something well before it becomes a problem.

Good posting everyone.
Have you looked at the unnecessary usage of transactions for simple SELECT(s)? I got burned on that one a few times... I also did some code cleanup and found MANY graphs being returned for maybe 10 records needed.... on and on... sometimes it's not YOUR code per se, but someone cutting corners.... Good luck!

If the client doesn't see a need to do performance optimization, then there's no reason to do it.
Defining a measurable performance requirements SLA with the client (i.e., 95% of queries complete in under 2 seconds) near the beginning of the project lets you know if you're meeting that goal, or if you have more optimization to do. Performance testing at current and estimated future loads gives you the data that you need to see if you're meeting the SLA.

Optimization is rarely worth it before you know what needs to be optimized. Always remember that if I/O is basically idle and CPU is low, you're not getting anything out of the computer. Obviously you don't want the CPU pegged all the time and you don't want to be running out of I/O bandwidth, but realize that trying to have the computer basically idle all day while it performs intense operations is unrealistic.
Wait until you start to reach a predefined threshold (80% utilization is the mark I usually use, others think that's too high/low) and then optimize at that point if necessary. Keep in mind that the best solution may be scaling up or scaling out and not actually optimizing the software.

Use Amdal's law. It shows you which is the overall improvement when optimizing a certain part of a system.
Also: If it ain't broke, don't fix it.

Optimization is worth the time spent on it when you get good speedups for spending little time in optimizing (obviously). To get that, you need tools/techniques that lead you very quickly to the code whose optimization yields the most benefit.
It is common to think that the way to find that code is by measuring the time spent by functions, but to my mind that only provides clues - you still have to play detective. What takes me straight to the code is stackshots. Here is an example of a 40-times speedup, achieved by finding and fixing several problems. Others on SO have reported speedup factors from 7 to 60, achieved with little effort.*
*(7x: Comment 1. 60x: Comment 30.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio