I want to profile some modules (for example network subsystem module).
Can we profile time / cpu utilization of a function in kernel module?
I heard about some profilers such as:
perf for system-wide profiling
valgrind -- application level
Is there any profiler to best suit for my use case above?
I really appreciate your time, thanks
You had it right! Perf is the tool for you. Since you're going to profile a kernel module, there's no point in using any userland tools such as valgrind
Usually when monitoring software you care about how much time your system spends in each system, this can be achieved by perf top that will give you a good estimate of much of the time you system is spending at each function.
Functions that you're spending a lot of time in can be very good pointers for optimization.
I'm not sure I understand the time / cpu model you require, but I think the above should meet your needs.
You can read more about how to use perf here.
[EDIT]
Like #myaut said, there are other kernel profiling tools. While I have very good experience with perf and I disagree with #myaut about the quality of the results, it is well worth mentioning some of the other tools. If you're just interested in getting the job done perf will do just fine, but if you want to learn about other profiling tools and their abilities, I found this nice reference here
(...Don't forget to kindly mark #myaut or my answer as accepted if we helped you...)
I doubt that profiling itself will reveal useful results -- you will need to make this function to be called very often or spend significant time in it. Otherwise you will get very small amount of data since perf profiles all modules.
If you want to measure real time spend while executing function, I suggest you to look at SystemTap:
stap -e 'global tms;
probe kernel.function("dev_queue_xmit") {
tms[cpu()] = local_clock_ns(); }
probe kernel.function("dev_queue_xmit").return {
println(local_clock_ns() - tms[cpu()]); }'
This script saves local CPU time in nanoseconds to tms associative array on entry to function dev_queue_xmit(). When CPU leaves dev_queue_xmit(), second probe calculates delta. Note that if CPU will be switched in dev_queue_xmit(), it can show messy results.
To measure times for module, replace kernel.function("dev_queue_xmit") with module("NAME").function("*"), but attaching to many functions may affect performance. You may also use get_cycles() instead of local_clock_ns() to get CPU cycles.
Related
How to measure performance impact of a kext in OS X in terms of CPU, memory or thread usage during some user defined activities ? Any particular method tool that can be use from user land ? OR any approach/method that can be considered?
You've essentially got 2 options:
Instrumenting your kext with time measurements. Take stamps before and after the operation you're trying to measure using mach_absolute_time(), convert to a human-readable unit using absolutetime_to_nanoseconds() and take the difference, then collect that information somewhere in your kext where it can be extracted from userspace.
Sampling kernel stacks using dtrace (iprofiler -kernelstacks -timeprofiler from the command line, or using Instruments.app)
Personally, I've had a lot more success with the former method, although it's definitely more work. Most kext code runs so briefly that a sampling profiler barely catches any instances of it executing, unless you reduce the sampling interval so far that measurements start interfering with the system, or your kext is seriously slow. It's pretty easy to do though, so it's often a valid sanity check.
You can also get your compiler to instrument your code with counters (-fprofile-arcs), which in theory will allow you to combine the sampling statistics with the branch counters to determine the runtime of each branch. Extracting this data is a pain though (my code may help) and again, the statistical noise has made this useless for me in practice.
The explicit method also allows you to measure asynchronous operations, etc., but of course also comes with some intrinsic overhead. Accumulating the data safely is also a little tricky. (I use atomic operations, but you could use spinlocks too. Don't forget to not just measure means but also standard deviation, and minimum/maximum times.) And extracting the data can be a pain because you have to add a userspace interface to your kext for it. But it's definitely worth it!
For example, I have a large linear function (1 basic block, ~1000 instructions)
which is called many times. After some fiddling with compiler options I've got
an unexpected 10% performance degradation on Cortex-A57. Presumably it is due to
a little different instruction scheduling. I'd like to investigate the problem
deeper and find out what instruction combination causes unnecessary pipeline
stalls. But I have no idea how I could do that. I guess, I need a very detailed
execution trace to understand what happens, though I'm not sure if it is
possible to get such a trace.
So, the question is: What tools can I use to investigate such low-level
performance problems? How can I determine what prevents the CPU from executing
maximum number of instructions every cycle?
PS I'm mostly interested in Cortex-A57 cores, but I'd appreciate useful
information on any other core or even a different architecture.
PPS The function accesses the memory, but it is expected that almost all memory
accesses hit the cache. The assumption is confirmed by perf stat -e r42,r43
(L1D_CACHE_REFILL_LD and L1D_CACHE_REFILL_ST events).
Tools: I'm most familiar with Intel compilers and tools but notice there are several similar tools out there for the ARM ecosystem. Here are some techniques I recommend.
USE YOUR COMPILER It has many options that can give you a very good idea of what is going on.
Disable any optimizations (compiler option) while compiling your original code. This will tell you if the issue is related to code generation optimizations.
Do a before and after ASM dump, and compare. You may find code differences that you already know are suspect.
Make sure you are not including any debugging information. Debugging inserts check points and other things that can potentially impact the performance of your code. These bits of code will also change the execution of the code through the pipeline
Change the compiler options one at a time to identify if the issue is related to data or code alignment enforcement, etc. I'm sure you've already done this but am mentioning it for completeness.
Enable any compiler performance monitoring options that can be dumped to a log file. A lot of useful information can be found in compiler log files. On the other hand, they also contain info that can only be interpreted by those that live on a higher plain of existence, i.e. compiler writers.
USE A TOOL THAT DUMPS PMU EVENTS
I saw quite a few out there. My apologies for not giving references but you can do a simple search "tool arm pmu events". These can be extremely sophisticated and powerful, e.g. Intel VTune, or very basic and still very powerful, e.g. the command line SEP for x86.
Take a look at the performance events (PMU events) available to you and figure out which events you want to monitor. You can get these events from the ARM Cortex-A57 processor tech reference (Chapter 11, Performance Monitoring Unit).
USE A PMU DUMPING SDK
Use an SDK what has functions for acquiring the ARM PMU events. These SDKs provide you with APIs for selecting and acquiring PMU events, giving you very precise control. Inserting this monitoring code may impact the execution of your code, so be careful of its placement. Again, you can find plenty of such SDKs out there with a simple search.
STUDY UP ON PIPELINE DEBUGGING (IF YOU ARE REALLY INTO THIS TYPE OF STUFF)
Find a good architectural description of the pipeline, including reservation stations, # of ALUs, etc.
Find a good reference on how to figure out what is going on in the pipeline. Here's an example for x86. ARM is a different beast but x86 articles will give you the basics (and more) of what you need to analyze and and what you can do with what you find.
Good luck. Pipeline debugging can be fun but time consuming.
The profilers I have experience with (mainly the Digital Mars D profiler that comes w/ the compiler) seem to massively slow down the execution of the program being profiled. This has a major effect on my willingness to use a profiler, as it makes profiling a "real" run of a lot of my programs, as opposed to testing on a very small input, impractical. I don't know much about how profilers are implemented. Is a major (>2x) slowdown when profiling pretty much a fact of life, or are there profilers that avoid it? If it can be avoided, are there any fast profilers available for D, preferrably for D2 and preferrably for free?
I don't know about D profilers, but in general there are two different ways a profiler can collect profiling information.
The first is by instrumentation, by injecting logging calls all over the place. This slows down the application more or less. Typically more.
The second is sampling. Then the profiler breaks the application at regular intervals and inspects the call stack. This does not slow down the application very much at all.
The downside of a sampling profiler is that the result is not as detailed as with an instrumenting profiler.
Check the documentation for your profiler if you can run with sampling instead of instrumentation. Otherwise you have some new Google terms in "sampling" and "instrumenting".
My favorite method of profiling slows the program way way down, and that's OK. I run the program under the debugger, with a realistic load, and then I manually interrupt it. Then I copy the call stack somewhere, like to Notepad. So it takes on the order of a minute to collect one sample. Then I can either resume execution, or it's even OK to start it over from the beginning to get another sample.
I do this 10 or 20 times, long enough to see what the program is actually doing from a wall-clock perspective. When I see something that shows up a lot, then I take more samples until it shows up again. Then I stop and really study what it is in the process of doing and why, which may take 10 minutes or more. That's how I find out if that activity is something I can replace with more efficient code, i.e. it wasn't totally necessary.
You see, I'm not interested in measuring how fast or slow it's going. I can do that separately with maybe only a watch. I'm interested in finding out which activities take a large percentage of time (not amount, percentage), and if something takes a large percentage of time, that is the probability that each stackshot will see it.
By "activity" I don't necessarily mean where the PC hangs out. In realistic software the PC is almost always off in a system or library routine somewhere. Typically more important is call sites in our code. If I see, for example, a string of 3 calls showing up on half of the stack samples, that represents very good hunting, because if any one of those isn't truly necessary and can be done away with, execution time will drop by half.
If you want a grinning manager, just do that once or twice.
Even in what you would think would be math-heavy scientific number crunching apps where you would think low-level optimization and hotspots would rule the day, you know what I often find? The math library routines are checking arguments, not crunching. Often the code is not doing what you think it's doing, and you don't have to run it at top speed to find that out.
I'd say yes, both sampling and instrumenting forms of profiling will tax your program heavily - regardless of whose profiler you are using, and on what language.
You could try h3r3tic's xfProf, which is a sampling profiler. Haven't tried it myself, but that guy always makes cool stuff :)
From the description:
If the program is sampled only a few hundred (or thousand)
times per second, the performance overhead will not be noticeable.
All too often I read statements about some new framework and their "benchmarks." My question is a general one but to the specific points of:
What approach should a developer take to effectively instrument code to measure performance?
When reading about benchmarks and performance testing, what are some red-flags to watch out for that might not represent real results?
There are two methods of measuring performance: using code instrumentation and using sampling.
The commercial profilers (Hi-Prof, Rational Quantify, AQTime) I used in the past used code instrumentation (some of them could also use sampling) and in my experience, this gives the best, most detailed result. Especially Rational Quantity allow you to zoom in on results, focus on sub trees, remove complete call trees to simulate an improvement, ...
The downside of these instrumenting profilers is that they:
tend to be slow (your code runs about 10 times slower)
take quite some time to instrument your application
don't always correctly handle exceptions in the application (in C++)
can be hard to set up if you have to disable the instrumentation of DLL's (we had to disable instrumentation for Oracle DLL's)
The instrumentation also sometimes skews the times reported for low-level functions like memory allocations, critical sections, ...
The free profilers (Very Sleepy, Luke Stackwalker) that I use use sampling, which means that it is much easier to do a quick performance test and see where the problem lies. These free profilers don't have the full functionality of the commercial profilers (although I submitted the "focus on subtree" functionality for Very Sleepy myself), but since they are fast, they can be very useful.
At this time, my personal favorite is Very Sleepy, with Luke StackWalker coming second.
In both cases (instrumenting and sampling), my experience is that:
It is very difficult to compare the results of profilers over different releases of your application. If you have a performance problem in your release 2.0, profile your release 2.0 and try to improve it, rather than looking for the exact reason why 2.0 is slower than 1.0.
You must never compare the profiling results with the timing (real time, cpu time) results of an application that is run outside the profiler. If your application consumes 5 seconds CPU time outside the profiler, and when run in the profiler the profiler reports that it consumes 10 seconds, there's nothing wrong. Don't think that your application actually takes 10 seconds.
That's why you must consistently check results in the same environment. Consistently compare results of your application when run outside the profiler, or when run inside the profiler. Don't mix the results.
Also use a consistent environment and system. If you get a faster PC, your application could still run slower, e.g. because the screen is larger and more needs to be updated on screen. If moving to a new PC, retest the last (one or two) releases of your application on the new PC so you get an idea on how times scale to the new PC.
This also means: use fixed data sets and check your improvements on these datasets. It could be that an improvement in your application improves the performance of dataset X, but makes it slower with dataset Y. In some cases this may be acceptible.
Discuss with the testing team what results you want to obtain beforehand (see Oded's answer on my own question What's the best way to 'indicate/numerate' performance of an application?).
Realize that a faster application can still use more CPU time than a slower application, if the faster one uses multi-threading and the slower one doesn't. Discuss (as said before) with the testing time what needs to be measured and what doesn't (in the multi-threading case: real time instead of CPU time).
Realize that many small improvements may lead to one big improvement. If you find 10 parts in your application that each take 3% of the time and you can reduce it to 1%, your application will be 20% faster.
It depends what you're trying to do.
1) If you want to maintain general timing information, so you can be alert to regressions, various instrumenting profilers are the way to go. Make sure they measure all kinds of time, not just CPU time.
2) If you want to find ways to make the software faster, that is a distinctly different problem.
You should put the emphasis on the find, not on the measure.
For this, you need something that samples the call stack, not just the program counter (over multiple threads, if necessary). That rules out profilers like gprof.
Importantly, it should sample on wall-clock time, not CPU time, because you are every bit as likely to lose time due to I/O as due to crunching. This rules out some profilers.
It should be able to take samples only when you care, such as not when waiting for user input. This also rules out some profilers.
Finally, and very important, is the summary you get.
It is essential to get per-line percent of time.
The percent of time used by a line is the percent of stack samples containing the line.
Don't settle for function-only timings, even with a call graph.
This rules out still more profilers.
(Forget about "self time", and forget about invocation counts. Those are seldom useful and often misleading.)
Accuracy of finding the problems is what you're after, not accuracy of measuring them. That is a very important point. (You don't need a large number of samples, though it does no harm. The harm is in your head, making you think about measuring, rather than what is it doing.)
One good tool for this is RotateRight's Zoom profiler. Personally I rely on manual sampling.
This is a rather general question ..
What hardware setup is best for large C/C++ compile jobs, like a Linux kernel or Applications ?
I remember reading a post by Joel Spolsky on experiments with solid state disks and stuff like that.
Do I have to have rather more CPU power or more RAM or a fast harddisk IO solution like solid state ? Would it for example be convenient to have a 'normal' harddisk for the standard system and then use a solid state for the compilation ? Or can I just buy lots of RAM ? And how important is the CPU, or is it just sitting around most of the compile time ?
Probably it's a stupid question, but I dont have a lot of experience on that field, thanks for answers.
Here's some info on the SSD issue
Linus Torvalds on the topic
How to get the most out of it (Standard settings are sort of slowing it down)
Interesting article on Coding Horror about it and with a tip for a payable Chipset
I think you need enough of everything. CPU is very important, and compiles can easily be parallelised (with make -j), so you want as many CPU cores as possible. Then, RAM is probably just as important, since it provides more 'working space' for the compiler and allows to your IO to be buffered. Finally, of course drive speed is probably the least important of the three - kernel code is big, but not that big.
Definitely not a stupid question, getting the build-test environment tuned correctly will make a lot of headaches go away.
Hard disk performance would probably top the list. I'd stay well away from the solid state drive as they're only rated for a large-but-limited number of write operations and make-clean-build cycles will hammer it.
More importantly, can you take advantage of a parallel or shared build environment - from memory ClearCase and Perforce had mechanisms to handle shared builds. Unless you have a parallelising build system, having multiple CPUs will be fairly pointless
Last but not least, I would doubt that the build time would be the limiting factor - more likely you should focus on the needs of your test system. Before you look at the actual metal though, try to design a build-test system that's appropriate to how you'll actually be working - how often are your builds, how many people are involved, how big is your test system ....