How can I get stats when using make to compile gcc? - gcc

I would like to get the stats about cpu usage, memory consumption, filesystem related stuff and the time spent compiling the various stages and components / sublibraries ( plus other important bits ), after a successful build done with make when building gcc .
It's possible to get stats out of make ?

I don't know any tool which can do everything.
For a very basic overview (time spent in the process), use time make ....
If you need more details or exact figures, you need a profiler. For CPU usage, use gprof. For memory usage, you can use valgrind. For IO, you can use ioprofile or iogrind.

Related

Benchmarking of the CPU overhead introduced by .so library or Linux executable

I'm looking for a reliable way to measure quantitatively the CPU overhead introduced by my shared library. The library is loaded in the context of php-fpm process (as a PHP extension). Ultimately, I'd like to run a series of tests for two versions of the .so library, collect stats and compare it, to see that to what extent the current version of library is better/worse than the previous one.
I have already tried a few approaches.
"perf stat" connected to PHP-FPM process and measuring cpu-cycles, instructions, and cpu-time
"perf record" connected to PHP-FPM process and collecting CPU cycles. Then I extract data from the collected perf.data. I consider only data related to my .so file and all-inclusive invocations (itself + related syscalls and kernel). So I can get CPU overhead for .so (inclusively).
valgrind on a few scripts running from CLI measuring "instruction requests".
All three options work but don't provide a reliable way for overhead comparison due to deviations (it might be up to 30% error, which is not applicable). It tried to do multiple runs and calculate the average, but the accuracy of that is also questionable.
Valgrind is the most accurate among all, but it provides only the "instructions" which do not give actual CPU overhead. Perf is better (considering the cycles, cpu-time, and instructions), but it gives too high errors from run to run.
Have anyone got experience in similar tasks? Could you recommend other approaches or Linux profilers to measure overhead accurately and quantitatively?
Thanks!

How to measure the load balancing in OpenMP of GCC

I am writing a program with GCC OpenMP. And Now I want to check if my OpenMP program has good balanced load. Are there some methods to do this?
BTW, what is the good method to measure the load balancing? (I don't want to use Intel VTune tool.)
I am not sure if this is the right place for my question, any replies are appreciated. And I make the question more detailed.
I am writting OpenMP programs under GCC compiler. And I want to know the details about the overhead of GCC-OpenMP. My concerns are given below.
1) What is the good way to optimize my OpenMP program? There are many aspects that will affect the performance, such as load balancing, locality, scheduling overhead, synchronization, and so on. In which order should I check these aspects.
2) I want to know how to get the load balancing of my application under GCC-OpenMP. How to instrument my application and the OpenMP runtime to extract the load balancing feature?
3) I guess OpenMP will spend some time on scheduling. What runtime APIs should I instrument to get the value of scheduling overhead?
4) Can I measure the time that OpenMP program spend on synchronization, critical, lock and atomic operations?
OmpP is a profiler for OpenMP applications. The profiler reports percentage of the execution spent in critical section and also measures the imbalance at the implicit barriers of different OpenMP constructs
A different approach to measure load imbalance would be to use the likwid-perfctr tool. The tool reports the number of instructions executed in each core. For applications with the same amount of work per thread, a variance in the number of instructions executed in different cores is an indicator of load imbalance.

Kernel module profilers

I want to profile some modules (for example network subsystem module).
Can we profile time / cpu utilization of a function in kernel module?
I heard about some profilers such as:
perf for system-wide profiling
valgrind -- application level
Is there any profiler to best suit for my use case above?
I really appreciate your time, thanks
You had it right! Perf is the tool for you. Since you're going to profile a kernel module, there's no point in using any userland tools such as valgrind
Usually when monitoring software you care about how much time your system spends in each system, this can be achieved by perf top that will give you a good estimate of much of the time you system is spending at each function.
Functions that you're spending a lot of time in can be very good pointers for optimization.
I'm not sure I understand the time / cpu model you require, but I think the above should meet your needs.
You can read more about how to use perf here.
[EDIT]
Like #myaut said, there are other kernel profiling tools. While I have very good experience with perf and I disagree with #myaut about the quality of the results, it is well worth mentioning some of the other tools. If you're just interested in getting the job done perf will do just fine, but if you want to learn about other profiling tools and their abilities, I found this nice reference here
(...Don't forget to kindly mark #myaut or my answer as accepted if we helped you...)
I doubt that profiling itself will reveal useful results -- you will need to make this function to be called very often or spend significant time in it. Otherwise you will get very small amount of data since perf profiles all modules.
If you want to measure real time spend while executing function, I suggest you to look at SystemTap:
stap -e 'global tms;
probe kernel.function("dev_queue_xmit") {
tms[cpu()] = local_clock_ns(); }
probe kernel.function("dev_queue_xmit").return {
println(local_clock_ns() - tms[cpu()]); }'
This script saves local CPU time in nanoseconds to tms associative array on entry to function dev_queue_xmit(). When CPU leaves dev_queue_xmit(), second probe calculates delta. Note that if CPU will be switched in dev_queue_xmit(), it can show messy results.
To measure times for module, replace kernel.function("dev_queue_xmit") with module("NAME").function("*"), but attaching to many functions may affect performance. You may also use get_cycles() instead of local_clock_ns() to get CPU cycles.

Cuda profiler says that my two kernels are expensive, however their execution time seems to be small

I use two kernels, let's call them A an B.
I run the CUDA profiler and this is what it returned:
The first kernel has 44% overhead while the second 20%.
However, if I decide to find out the actual execution time by following this logic:
timeval tim;
gettimeofday(&tim, NULL);
double before = tim.tv_sec+(tim.tv_usec/1000000.0);
runKernel<<<...>>>(...)
gettimeofday(&tim, NULL);
double after=tim.tv_sec+(tim.tv_usec/1000000.0);
totalTime = totalTime + after - before;
The totalTime will be very small, somewhere around 0.0001 seconds.
I'm new to CUDA and I don't understand exactly what's going on. Should I try and make the kernels more efficient or are they already efficient?
Kernel calls are asynchronous from the point of view of the CPU (see this answer). If you time your kernel the way you do without any synchronization (i.e. without calling cudaDeviceSynchronize()), your timings will not mean anything since computation is still in progress on the GPU.
You can trust NVIDIA's profilers when it comes to timing your kernels (nvprof / nvvp). The NVIDIA Visual Profiler can also analyze your program and provide some pieces of advice on what may be wrong with your kernels: uncoalesced memory accesses, unefficient number of threads/blocks assigned etc. You also need to compile your code in release mode with optimization flags (e.g. -O3) to get some relevant timings.
Concerning kernel optimization, you need to find your bottlenecks (e.g. your 44% kernel), analyze it, and apply the usual optimization techniques:
Use the effective bandwidth of your device to work out what the upper bound on performance ought to be for your kernel
Minimize memory transfers between host and device - even if that means doing calculations on the device which are not efficient there
Coalesce all memory accesses
Prefer shared memory access to global memory access
Avoid code execution branching within a single warp as this serializes the threads
You can also use instruction-level parallelism (you should read these slides).
It is however hard to know when you cannot optimize your kernels anymore. Saying that the execution time of your kernels is small does not mean much: small compared to what? Are you trying to do some real time computation? Is scalability an issue? These are some of the questions that you need to answer before trying to optimize your kernels.
On a side note, you should also use error checking extensively, and rely on cuda-memcheck/cuda-gdb to debug your code.

How do I figure out whether my process is CPU bound, I/O bound, Memory bound or

I'm trying to speed up the time taken to compile my application and one thing I'm investigating is to check what resources, if any, I can add to the build machine to speed things up. To this end, how do I figure out if I should invest in more CPU, more RAM, a better hard disk or whether the process is being bound by some other resource? I already saw this (How to check if app is cpu-bound or memory-bound?) and am looking for more tips and pointers.
What I've tried so far:
Time the process on the build machine vs. on my local machine. I found that the build machine takes twice the time as my machine.
Run "Resource Monitor" and look at the CPU usage, Memory usage and Disk usage while the process is running - while doing this, I have trouble interpreting the numbers, mainly because I don't understand what each column means and how that translates to a Virtual Machine vs. a physical box and what it means with multi-CPU boxes.
Start > Run > perfmon.exe
Performance Monitor can graph many system metrics that you can use to deduce where the bottlenecks are including cpu load, io operations, pagefile hits and so on.
Additionally, the Platform SDK now includes a tool called XPerf that can provide information more relevant to developers.
Random-pausing will tell you what is your percentage split between CPU and I/O time.
Basically, if you grab 10 random stackshots, and if 80% (for example) of the time is in I/O, then on 8+/-1.3 samples the stack will reach into the system routine that reads or writes a buffer.
If you want higher precision, take more samples.

Resources