Do all profilers significantly slow execution? - performance

The profilers I have experience with (mainly the Digital Mars D profiler that comes w/ the compiler) seem to massively slow down the execution of the program being profiled. This has a major effect on my willingness to use a profiler, as it makes profiling a "real" run of a lot of my programs, as opposed to testing on a very small input, impractical. I don't know much about how profilers are implemented. Is a major (>2x) slowdown when profiling pretty much a fact of life, or are there profilers that avoid it? If it can be avoided, are there any fast profilers available for D, preferrably for D2 and preferrably for free?

I don't know about D profilers, but in general there are two different ways a profiler can collect profiling information.
The first is by instrumentation, by injecting logging calls all over the place. This slows down the application more or less. Typically more.
The second is sampling. Then the profiler breaks the application at regular intervals and inspects the call stack. This does not slow down the application very much at all.
The downside of a sampling profiler is that the result is not as detailed as with an instrumenting profiler.
Check the documentation for your profiler if you can run with sampling instead of instrumentation. Otherwise you have some new Google terms in "sampling" and "instrumenting".

My favorite method of profiling slows the program way way down, and that's OK. I run the program under the debugger, with a realistic load, and then I manually interrupt it. Then I copy the call stack somewhere, like to Notepad. So it takes on the order of a minute to collect one sample. Then I can either resume execution, or it's even OK to start it over from the beginning to get another sample.
I do this 10 or 20 times, long enough to see what the program is actually doing from a wall-clock perspective. When I see something that shows up a lot, then I take more samples until it shows up again. Then I stop and really study what it is in the process of doing and why, which may take 10 minutes or more. That's how I find out if that activity is something I can replace with more efficient code, i.e. it wasn't totally necessary.
You see, I'm not interested in measuring how fast or slow it's going. I can do that separately with maybe only a watch. I'm interested in finding out which activities take a large percentage of time (not amount, percentage), and if something takes a large percentage of time, that is the probability that each stackshot will see it.
By "activity" I don't necessarily mean where the PC hangs out. In realistic software the PC is almost always off in a system or library routine somewhere. Typically more important is call sites in our code. If I see, for example, a string of 3 calls showing up on half of the stack samples, that represents very good hunting, because if any one of those isn't truly necessary and can be done away with, execution time will drop by half.
If you want a grinning manager, just do that once or twice.
Even in what you would think would be math-heavy scientific number crunching apps where you would think low-level optimization and hotspots would rule the day, you know what I often find? The math library routines are checking arguments, not crunching. Often the code is not doing what you think it's doing, and you don't have to run it at top speed to find that out.

I'd say yes, both sampling and instrumenting forms of profiling will tax your program heavily - regardless of whose profiler you are using, and on what language.

You could try h3r3tic's xfProf, which is a sampling profiler. Haven't tried it myself, but that guy always makes cool stuff :)
From the description:
If the program is sampled only a few hundred (or thousand)
times per second, the performance overhead will not be noticeable.

Related

dotTrace - what profiling settings should I use for my desktop app?

When using dotTrace, I have to pick a profiling mode and a time measurement method. Profiling modes are:
Tracing
Line-by-line
Sampling
And time measurement methods are:
Wall time (performance counter)
Thread time
Wall time (CPU instruction)
Tracing and line-by-line can't use thread time measurement. But that still leaves me with seven different combinations to try. I've now read the dotTrace help pages on these well over a dozen times, and I remain no more knowledgeable than I started out about which one to pick.
I'm working on a WPF app that reads Word docs, extracts all the paragraphs and styles, and then loops through that extracted content to pick out document sections. I'm trying to optimize this process. (Currently it takes well over an hour to complete, so I'm trying to profile it for a given length of time rather than until it finishes.)
Which profiling and time measurement types would give me the best results? Or if the answer is "It depends", then what does it depend on? What are the pros and cons of a given profiling mode or time measurement method?
Profiling types:
Sampling: fastest but least accurate profiling-type, minimum profiler overhead. Essentially equivalent to pausing the program many times a second and viewing the stacktrace; thus the number of calls per method is approximate. Still useful for identifying performance bottlenecks at the method-level.
Snapshots captured in sampling mode occupy a lot less space on disk (I'd say 5-6 less space.)
Use for initial assessment or when profiling a long-running application (which sounds like your case.)
Tracing: Records the duration taken for each method. App under profiling runs slower but in return, dotTrace shows exact number of calls of each function, and function timing info is more accurate. This is good for diving into details of a problem at the method-level.
Line-by-line: Profiles the program on a per-line basis. Largest resource hog but most fine-grained profiling results. Slows the program way down. The preferred tactic here is to initially profile using another type, and then hand-pick functions for line-by-line profiling.
As for meter kinds, I think they are described quite well in Getting started with dotTrace Performance by the great Hadi Hariri.
Wall time (CPU Instruction): This is the simplest and fastest way to measure wall time (that is, the
time we observe on a wall clock). However, on some older multi-core processors this may produce
incorrect results due to the cores timers being desynchronized. If this is the case, it is recommended
to use Performance Counter.
Wall time (Performance Counter): Performance counters is part of the Windows API and it allows
taking time samples in a hardware-independent way. However, being an API call, every measure takes
substantial time and therefore has an impact on the profiled application.
Thread time: In a multi-threaded application concurrent threads contribute to each other's wall time.
To avoid such interference we can use thread time meter which makes system API calls to get the
amount of time given by the OS scheduler to the thread. The downsides are that taking thread time
samples is much slower than using CPU counter and the precision is also limited by the size of
quantum used by thread scheduler (normally 10ms). This mode is only supported when the Profiling
Type is set to Sampling
However they don't differ too much.
I'm not a wizard in profiling myself but in your case I'd start with sampling to get a list of functions that take ridiculously long to execute, and then I'd mark them for line-by-line profiling.

Performance profiling of Windows Apps. Better alternatives for Visual Studio Profiler?

I am impressed with Visual Studio Profiler for performance analysis. Fast for my purposes and easy to use.
I am just curious to know about the caveats in visual studio profiler. Are there any better profilers for windows applications which fare better for these caveats?
On the positive side, nobody makes great apps like Microsoft. Visual Studio is a fine product, and its profiler shares those attributes.
On the other hand, there are caveats (shared by other profilers as well).
In sampling mode, it doesn't sample when the thread is blocked. Therefore it is blind to extraneous I/O, socket calls, etc. This is an attribute that dates from the early days of prof and gprof, which started out as PC samplers, and since when blocked the PC is meaningless, sampling was turned off. The PC may be meaningless, but the stack tells exactly why the thread is blocked and, when there is much time going into that, you need to know it.
In instrumentation mode, it can include I/O, but it only gives you function-level percent of time, not line level. That may be OK if functions happen to be small, or if they only call each other in a small number of places, so finding call sites is not too hard. I work with good programmers, but our code is not all like that. In fact, often the call sites are invisible, because they are compiler-inserted. On the other hand, stack samples pinpoint those calls no matter who wrote them.
The profiler does a nice job of showing you the split between activity of different threads. Then what you need to know is, if a thread is suspended or showing a low processor activity, is that because it is blocking for something that it doesn't really have to? Stack samples could tell you that if they could be taken during blocking. On the other hand, if a thread is cranking heavily, do you know if what it is doing is actually necessary or could be reduced? Stack samples will tell you that also.
Many people think the primary job of a profiler is to measure. Personally, I want something that pinpoints code that costs a lot of time and can be done more efficiently. Most of the time these are function call sites, not "hot spots". I don't need to know "a lot of time" with any precision. It I know it is, say, 60% +/- 20% that's perfectly fine with me because I'm looking for the problem, not the measurement. If because of this imprecision, I fix a problem which is not the largest, that's OK, because when I repeat the process, the largest problem will be even bigger, as a percent, so I won't miss it.

Effective Code Instrumentation?

All too often I read statements about some new framework and their "benchmarks." My question is a general one but to the specific points of:
What approach should a developer take to effectively instrument code to measure performance?
When reading about benchmarks and performance testing, what are some red-flags to watch out for that might not represent real results?
There are two methods of measuring performance: using code instrumentation and using sampling.
The commercial profilers (Hi-Prof, Rational Quantify, AQTime) I used in the past used code instrumentation (some of them could also use sampling) and in my experience, this gives the best, most detailed result. Especially Rational Quantity allow you to zoom in on results, focus on sub trees, remove complete call trees to simulate an improvement, ...
The downside of these instrumenting profilers is that they:
tend to be slow (your code runs about 10 times slower)
take quite some time to instrument your application
don't always correctly handle exceptions in the application (in C++)
can be hard to set up if you have to disable the instrumentation of DLL's (we had to disable instrumentation for Oracle DLL's)
The instrumentation also sometimes skews the times reported for low-level functions like memory allocations, critical sections, ...
The free profilers (Very Sleepy, Luke Stackwalker) that I use use sampling, which means that it is much easier to do a quick performance test and see where the problem lies. These free profilers don't have the full functionality of the commercial profilers (although I submitted the "focus on subtree" functionality for Very Sleepy myself), but since they are fast, they can be very useful.
At this time, my personal favorite is Very Sleepy, with Luke StackWalker coming second.
In both cases (instrumenting and sampling), my experience is that:
It is very difficult to compare the results of profilers over different releases of your application. If you have a performance problem in your release 2.0, profile your release 2.0 and try to improve it, rather than looking for the exact reason why 2.0 is slower than 1.0.
You must never compare the profiling results with the timing (real time, cpu time) results of an application that is run outside the profiler. If your application consumes 5 seconds CPU time outside the profiler, and when run in the profiler the profiler reports that it consumes 10 seconds, there's nothing wrong. Don't think that your application actually takes 10 seconds.
That's why you must consistently check results in the same environment. Consistently compare results of your application when run outside the profiler, or when run inside the profiler. Don't mix the results.
Also use a consistent environment and system. If you get a faster PC, your application could still run slower, e.g. because the screen is larger and more needs to be updated on screen. If moving to a new PC, retest the last (one or two) releases of your application on the new PC so you get an idea on how times scale to the new PC.
This also means: use fixed data sets and check your improvements on these datasets. It could be that an improvement in your application improves the performance of dataset X, but makes it slower with dataset Y. In some cases this may be acceptible.
Discuss with the testing team what results you want to obtain beforehand (see Oded's answer on my own question What's the best way to 'indicate/numerate' performance of an application?).
Realize that a faster application can still use more CPU time than a slower application, if the faster one uses multi-threading and the slower one doesn't. Discuss (as said before) with the testing time what needs to be measured and what doesn't (in the multi-threading case: real time instead of CPU time).
Realize that many small improvements may lead to one big improvement. If you find 10 parts in your application that each take 3% of the time and you can reduce it to 1%, your application will be 20% faster.
It depends what you're trying to do.
1) If you want to maintain general timing information, so you can be alert to regressions, various instrumenting profilers are the way to go. Make sure they measure all kinds of time, not just CPU time.
2) If you want to find ways to make the software faster, that is a distinctly different problem.
You should put the emphasis on the find, not on the measure.
For this, you need something that samples the call stack, not just the program counter (over multiple threads, if necessary). That rules out profilers like gprof.
Importantly, it should sample on wall-clock time, not CPU time, because you are every bit as likely to lose time due to I/O as due to crunching. This rules out some profilers.
It should be able to take samples only when you care, such as not when waiting for user input. This also rules out some profilers.
Finally, and very important, is the summary you get.
It is essential to get per-line percent of time.
The percent of time used by a line is the percent of stack samples containing the line.
Don't settle for function-only timings, even with a call graph.
This rules out still more profilers.
(Forget about "self time", and forget about invocation counts. Those are seldom useful and often misleading.)
Accuracy of finding the problems is what you're after, not accuracy of measuring them. That is a very important point. (You don't need a large number of samples, though it does no harm. The harm is in your head, making you think about measuring, rather than what is it doing.)
One good tool for this is RotateRight's Zoom profiler. Personally I rely on manual sampling.

Compact Framework and JIT. How long could it take

We have/had a phantom delay in our app. This was traced to the initialisation of a singleton when the object was touched for the first time and was blamed on JIT. I'm not utterly convinced by this as there is no mechanism for measuring JIT (or is there?) and the entire delay was seven seconds. Seven seconds of JIT?!? Could that be forreal?
Either way I have difficulty in blaming things that one cannot easily measure. When I had a glance at the issue a while back I commented out a bunch of code and watched the seven second delay "jump" elsewhere in the app. Suggesting it is somehow happening on a background process somewhere (and I guess this would count JIT in as a potential cause).
Just for fun if there was a static object that happened to reference a lot of other objects does anyone have a rule of thumb for how long the JIT might take? Does anyone have further references so I can understand more about the JIT so I stand a chance of learning whether or not JIT is/was to blame for this slow down?
I've only seen JIT take a really long time (greater than 1 second) in a weird bug that had to do with templated items inside a templated collection (see edit below).
At any rate, the fact you see it "move" definitely indicates to me that it probably isn't the issue. To try to determine this definitively I'd look at using RPM to see what's happening right before and after the delay.
Expected JIT time is a really nebulous thing, since there are so many factors that can affect it. Processor speed is an obvious one, but less obvious might be things like app storage media and device memory pressure.
Storage media can affect JIT speed because the JITter has to pull the IL from the media when it needs to JIT it, and if pulling it is slow, then JITting it will be slow.
Memory pressure is a tough one, and can have serious repercussions on a CE device. The issue here is that when you start running out of memory, the EE will start pitching JITted code during collection - everything but the call stack. Now if you're in a routine that, for example, calls out to some worker or helper stuff, or has a thread running, then that helper method could be getting pitched, JITted, pitched JITted, etc. This is referred to as "thrash."
Identifying the latter is fairly easy with RPM (fixing it may not be so easy). Look at the amount of code pitched to raise frequently and look for a strong correlation between a rise in the number of pitches and your perceived lock ups.
Edit: I finally found the bug description here.
JIT (and GC) timers etc. can be found here:
Performance Counters in the .NET Compact Framework
(http://msdn.microsoft.com/en-us/library/ms172525.aspx)
Monitoring Application Performance on the .NET Compact Framework Part I - Enabling performance counters (http://blogs.msdn.com/davidklinems/archive/2005/10/04/476988.aspx)
Analyzing Device Application Performance with the .Net Compact Framework Remote Performance Monitor (http://blogs.msdn.com/stevenpr/archive/2006/04/17/577636.aspx)
Performance Counters in the .NET Framework
(http://msdn.microsoft.com/en-us/library/w8f5kw2e(VS.80).aspx)
Regards,
tamberg

Power Efficient Software Coding

In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?
Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.
Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.
Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.
From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.
To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.
Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.
Look at what your compiler generates, particularly for hot areas of code.
If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.
Simply put, do as little as possible.
Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.
Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.
Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.
also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware
Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.
On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/
Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.
Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this

Resources