All too often I read statements about some new framework and their "benchmarks." My question is a general one but to the specific points of:
What approach should a developer take to effectively instrument code to measure performance?
When reading about benchmarks and performance testing, what are some red-flags to watch out for that might not represent real results?

There are two methods of measuring performance: using code instrumentation and using sampling.
The commercial profilers (Hi-Prof, Rational Quantify, AQTime) I used in the past used code instrumentation (some of them could also use sampling) and in my experience, this gives the best, most detailed result. Especially Rational Quantity allow you to zoom in on results, focus on sub trees, remove complete call trees to simulate an improvement, ...
The downside of these instrumenting profilers is that they:
tend to be slow (your code runs about 10 times slower)
take quite some time to instrument your application
don't always correctly handle exceptions in the application (in C++)
can be hard to set up if you have to disable the instrumentation of DLL's (we had to disable instrumentation for Oracle DLL's)
The instrumentation also sometimes skews the times reported for low-level functions like memory allocations, critical sections, ...
The free profilers (Very Sleepy, Luke Stackwalker) that I use use sampling, which means that it is much easier to do a quick performance test and see where the problem lies. These free profilers don't have the full functionality of the commercial profilers (although I submitted the "focus on subtree" functionality for Very Sleepy myself), but since they are fast, they can be very useful.
At this time, my personal favorite is Very Sleepy, with Luke StackWalker coming second.
In both cases (instrumenting and sampling), my experience is that:
It is very difficult to compare the results of profilers over different releases of your application. If you have a performance problem in your release 2.0, profile your release 2.0 and try to improve it, rather than looking for the exact reason why 2.0 is slower than 1.0.
You must never compare the profiling results with the timing (real time, cpu time) results of an application that is run outside the profiler. If your application consumes 5 seconds CPU time outside the profiler, and when run in the profiler the profiler reports that it consumes 10 seconds, there's nothing wrong. Don't think that your application actually takes 10 seconds.
That's why you must consistently check results in the same environment. Consistently compare results of your application when run outside the profiler, or when run inside the profiler. Don't mix the results.
Also use a consistent environment and system. If you get a faster PC, your application could still run slower, e.g. because the screen is larger and more needs to be updated on screen. If moving to a new PC, retest the last (one or two) releases of your application on the new PC so you get an idea on how times scale to the new PC.
This also means: use fixed data sets and check your improvements on these datasets. It could be that an improvement in your application improves the performance of dataset X, but makes it slower with dataset Y. In some cases this may be acceptible.
Discuss with the testing team what results you want to obtain beforehand (see Oded's answer on my own question What's the best way to 'indicate/numerate' performance of an application?).
Realize that a faster application can still use more CPU time than a slower application, if the faster one uses multi-threading and the slower one doesn't. Discuss (as said before) with the testing time what needs to be measured and what doesn't (in the multi-threading case: real time instead of CPU time).
Realize that many small improvements may lead to one big improvement. If you find 10 parts in your application that each take 3% of the time and you can reduce it to 1%, your application will be 20% faster.

It depends what you're trying to do.
1) If you want to maintain general timing information, so you can be alert to regressions, various instrumenting profilers are the way to go. Make sure they measure all kinds of time, not just CPU time.
2) If you want to find ways to make the software faster, that is a distinctly different problem.
You should put the emphasis on the find, not on the measure.
For this, you need something that samples the call stack, not just the program counter (over multiple threads, if necessary). That rules out profilers like gprof.
Importantly, it should sample on wall-clock time, not CPU time, because you are every bit as likely to lose time due to I/O as due to crunching. This rules out some profilers.
It should be able to take samples only when you care, such as not when waiting for user input. This also rules out some profilers.
Finally, and very important, is the summary you get.
It is essential to get per-line percent of time.
The percent of time used by a line is the percent of stack samples containing the line.
Don't settle for function-only timings, even with a call graph.
This rules out still more profilers.
(Forget about "self time", and forget about invocation counts. Those are seldom useful and often misleading.)
Accuracy of finding the problems is what you're after, not accuracy of measuring them. That is a very important point. (You don't need a large number of samples, though it does no harm. The harm is in your head, making you think about measuring, rather than what is it doing.)
One good tool for this is RotateRight's Zoom profiler. Personally I rely on manual sampling.


dotTrace - what profiling settings should I use for my desktop app?

When using dotTrace, I have to pick a profiling mode and a time measurement method. Profiling modes are:
And time measurement methods are:
Wall time (performance counter)
Thread time
Wall time (CPU instruction)
Tracing and line-by-line can't use thread time measurement. But that still leaves me with seven different combinations to try. I've now read the dotTrace help pages on these well over a dozen times, and I remain no more knowledgeable than I started out about which one to pick.
I'm working on a WPF app that reads Word docs, extracts all the paragraphs and styles, and then loops through that extracted content to pick out document sections. I'm trying to optimize this process. (Currently it takes well over an hour to complete, so I'm trying to profile it for a given length of time rather than until it finishes.)
Which profiling and time measurement types would give me the best results? Or if the answer is "It depends", then what does it depend on? What are the pros and cons of a given profiling mode or time measurement method?
Profiling types:
Sampling: fastest but least accurate profiling-type, minimum profiler overhead. Essentially equivalent to pausing the program many times a second and viewing the stacktrace; thus the number of calls per method is approximate. Still useful for identifying performance bottlenecks at the method-level.
Snapshots captured in sampling mode occupy a lot less space on disk (I'd say 5-6 less space.)
Use for initial assessment or when profiling a long-running application (which sounds like your case.)
Tracing: Records the duration taken for each method. App under profiling runs slower but in return, dotTrace shows exact number of calls of each function, and function timing info is more accurate. This is good for diving into details of a problem at the method-level.
Line-by-line: Profiles the program on a per-line basis. Largest resource hog but most fine-grained profiling results. Slows the program way down. The preferred tactic here is to initially profile using another type, and then hand-pick functions for line-by-line profiling.
As for meter kinds, I think they are described quite well in Getting started with dotTrace Performance by the great Hadi Hariri.
Wall time (CPU Instruction): This is the simplest and fastest way to measure wall time (that is, the
time we observe on a wall clock). However, on some older multi-core processors this may produce
incorrect results due to the cores timers being desynchronized. If this is the case, it is recommended
to use Performance Counter.
Wall time (Performance Counter): Performance counters is part of the Windows API and it allows
taking time samples in a hardware-independent way. However, being an API call, every measure takes
substantial time and therefore has an impact on the profiled application.
Thread time: In a multi-threaded application concurrent threads contribute to each other's wall time.
To avoid such interference we can use thread time meter which makes system API calls to get the
amount of time given by the OS scheduler to the thread. The downsides are that taking thread time
samples is much slower than using CPU counter and the precision is also limited by the size of
quantum used by thread scheduler (normally 10ms). This mode is only supported when the Profiling
Type is set to Sampling
However they don't differ too much.
I'm not a wizard in profiling myself but in your case I'd start with sampling to get a list of functions that take ridiculously long to execute, and then I'd mark them for line-by-line profiling.

Do all profilers significantly slow execution?

The profilers I have experience with (mainly the Digital Mars D profiler that comes w/ the compiler) seem to massively slow down the execution of the program being profiled. This has a major effect on my willingness to use a profiler, as it makes profiling a "real" run of a lot of my programs, as opposed to testing on a very small input, impractical. I don't know much about how profilers are implemented. Is a major (>2x) slowdown when profiling pretty much a fact of life, or are there profilers that avoid it? If it can be avoided, are there any fast profilers available for D, preferrably for D2 and preferrably for free?
I don't know about D profilers, but in general there are two different ways a profiler can collect profiling information.
The first is by instrumentation, by injecting logging calls all over the place. This slows down the application more or less. Typically more.
The second is sampling. Then the profiler breaks the application at regular intervals and inspects the call stack. This does not slow down the application very much at all.
The downside of a sampling profiler is that the result is not as detailed as with an instrumenting profiler.
Check the documentation for your profiler if you can run with sampling instead of instrumentation. Otherwise you have some new Google terms in "sampling" and "instrumenting".
My favorite method of profiling slows the program way way down, and that's OK. I run the program under the debugger, with a realistic load, and then I manually interrupt it. Then I copy the call stack somewhere, like to Notepad. So it takes on the order of a minute to collect one sample. Then I can either resume execution, or it's even OK to start it over from the beginning to get another sample.
I do this 10 or 20 times, long enough to see what the program is actually doing from a wall-clock perspective. When I see something that shows up a lot, then I take more samples until it shows up again. Then I stop and really study what it is in the process of doing and why, which may take 10 minutes or more. That's how I find out if that activity is something I can replace with more efficient code, i.e. it wasn't totally necessary.
You see, I'm not interested in measuring how fast or slow it's going. I can do that separately with maybe only a watch. I'm interested in finding out which activities take a large percentage of time (not amount, percentage), and if something takes a large percentage of time, that is the probability that each stackshot will see it.
By "activity" I don't necessarily mean where the PC hangs out. In realistic software the PC is almost always off in a system or library routine somewhere. Typically more important is call sites in our code. If I see, for example, a string of 3 calls showing up on half of the stack samples, that represents very good hunting, because if any one of those isn't truly necessary and can be done away with, execution time will drop by half.
If you want a grinning manager, just do that once or twice.
Even in what you would think would be math-heavy scientific number crunching apps where you would think low-level optimization and hotspots would rule the day, you know what I often find? The math library routines are checking arguments, not crunching. Often the code is not doing what you think it's doing, and you don't have to run it at top speed to find that out.
I'd say yes, both sampling and instrumenting forms of profiling will tax your program heavily - regardless of whose profiler you are using, and on what language.
You could try h3r3tic's xfProf, which is a sampling profiler. Haven't tried it myself, but that guy always makes cool stuff :)
From the description:
If the program is sampled only a few hundred (or thousand)
times per second, the performance overhead will not be noticeable.

Is log4net much slower than System.Diagnostics.Trace?

I'm investigating the differences between using log4net and System.Diagnostics.Trace for logging, and I'm curious about the performance differences I've observed.
I created a test application to compare the performance of both logging methods in several scenarios, and I'm finding that log4net is significantly slower than the Trace class. For example, in a scenario where I log 1,000 messages with no string formatting, log4net's mean execution time over 1,000 trials is 9.00ms. Trace executes with a mean of 1.13ms. A lot of my test cases have a relatively large amount of variance in the log4net execution times; the periodic nature of outlier long executions seems to suggest GC interference. Poking around with CLR Profiler confirms there are a large amount of collections for a ton of log4net.Core.LoggingEvent objects that are generated (to be fair, it looks like Trace generates a ton of Char[] objects as well, but it doesn't display the large variance that log4net does.)
One thing I'm keeping in mind here are that even though log4net seems roughly 9 times slower than Trace, the difference is 8ms over 1,000 iterations; this isn't exactly a significant performance drain. Still, some of my expected use cases might be calling methods that are logging things hundreds of thousands of times, and these numbers are from my fast machine. On a slower machine more typical of our users' configurations the difference is 170ms to 11ms which is a tiny bit more alarming.
Is this performance typical of log4net, or are there some gotchas that can significantly increase log4net's performance?
(NOTE: I am aware that string formatting can alter the execution time; I am trying to compare apples to apples and I have test cases with no formatting and test cases with formatting; log4net stays as proportionally slow whether string formatting is used or not.)
The story so far:
Robert Gould has the best answer to the question; I was mainly curious if it was typical to see log4net perform much slower than the Trace class.
Alex Shnayder's answer is interesting information but doesn't really fall under the scope of the question. Half of the intent for introducing this logging is to assist in debugging both logical and performance problems on live systems; our customers put our products in many exotic scenarios that are often difficult to reproduce without expensive and large-scale hardware configurations. My main concern is that a large timing difference between "not logging" and "logging" could affect the system in such a way that bugs don't happen. In the end, the scale of the performance decrease is large but the magnitude is small, so I'm hoping it won't be a problem.
yes log4xxx is slower than trace, since trace is normally a near kernel tool, while log4xxx is a much more powerful tool. Personally I prefer log4xxx because of it's fexibility, but if you want something that doesn't impact as much, and you don't really need logs for production,say in debug only trace should be enough.
Note: I use log4xxx because the exact same applies to all languages with a log4 library not just .Net
you might be interested in the Common.Logging library. It's a thin abstraction wrapper over existing logging implementations and allows you to plug in any logging framework you like at runtime. Also it is much faster then System.Diagnostics.Trace as described in my blog post about performance.
From my experience log4net performance isn't an issue in most cases.
The real question is, why would you even need to "logging things hundreds of thousands of times" in a production system.
As I see it, in production you should log only bare minimum (info nd may be warning level), and only if need to (debugging an issue on site) should activate debugging at debug level.
If you want the best of both worlds, log4net will allow you to log to the aspnet tracer as well. I turn this option on when I want to get performance stats that are tied in to specific events in my logging.
Have just run a test comparing sequental writing to a simple file compared to using Log4Net for the same task.Log4Net is about 400 times slower comparet to a StreamWriter.So I consider Log4Net not usable if You are writing to huge logfiles. But I find it very usefull for small amounts of log entries and debugging.
Maybe a solution to isolate logging in a separate thread in some cases.
