What does Callgrind measure when profiling a parallel code? - parallel-processing

I would like to profile my parallel code (both mpi and omp)
I found out that Callgrind is very easy to use and analyze (using Kcachegrind) for serial code as it can give you the relative time spent on different functions.
What would it give me when running a parallel code?
Would it only monitor the master process or will it sum over all process?
Can it detect deadlocks or place where one process is waiting to another?
Is there a better tool to use when profiling a parallel code?

According to 'Valgrind User Manual' Valgrind runs all threads serially as detailed in Support for Threads . Callgrind can monitor the overall access to functions but it's profile will be Inadequate.
regarding mpirun, there is a complete section describing Valgrind Memcheck performance with MPI: 4.9. Debugging MPI Parallel Programs with Valgrind , I dont know it's exact effect on Callgrind but I feel that these kind of profiling should remain serial

Related

Benchmarking of the CPU overhead introduced by .so library or Linux executable

I'm looking for a reliable way to measure quantitatively the CPU overhead introduced by my shared library. The library is loaded in the context of php-fpm process (as a PHP extension). Ultimately, I'd like to run a series of tests for two versions of the .so library, collect stats and compare it, to see that to what extent the current version of library is better/worse than the previous one.
I have already tried a few approaches.
"perf stat" connected to PHP-FPM process and measuring cpu-cycles, instructions, and cpu-time
"perf record" connected to PHP-FPM process and collecting CPU cycles. Then I extract data from the collected perf.data. I consider only data related to my .so file and all-inclusive invocations (itself + related syscalls and kernel). So I can get CPU overhead for .so (inclusively).
valgrind on a few scripts running from CLI measuring "instruction requests".
All three options work but don't provide a reliable way for overhead comparison due to deviations (it might be up to 30% error, which is not applicable). It tried to do multiple runs and calculate the average, but the accuracy of that is also questionable.
Valgrind is the most accurate among all, but it provides only the "instructions" which do not give actual CPU overhead. Perf is better (considering the cycles, cpu-time, and instructions), but it gives too high errors from run to run.
Have anyone got experience in similar tasks? Could you recommend other approaches or Linux profilers to measure overhead accurately and quantitatively?
Thanks!

Finding out where an application waits

After an update in Debian 10, the PDF viewer Atril (a fork of Evince) takes about 25 seconds to launch even with no document. Previously it was almost instantaneous. Now I need to find out what causes this delay. When I run Atril through the strace command, it pauses at different system calls each time so I cannot draw any conclusions from that. Next I built Atril from source and run it in the gdb debugger but there I can only see a couple of threads being created and exited. How can I find out where in the source code the delay is?
You could run the program in the debugger and then interrupt it a few times using Ctrl+C. Each time, use where or bt to see where you are in the execution. There is a good explanation of this approach here. You probably want to enable debugging symbols when compiling (GCC: -g). Also, the compiled code can differ significantly from your original source code if you have optimizations turned on, so it might make sense to only use a debugging optimization level (GCC: -Og).
If there is a single step that causes this huge delay, this should allow you to instantly identify it.
Otherwise, you could check out other answers of this question. Profiling tools on linux include valgrind and the perf tools.
For perf, keep in mind that the sampling based approach could mislead you if your program is actually not executing, but waiting, e.g. for IO.

Detecting AES-NI CPU instructions

Recent Intel and AMD CPUs support specific AES instructions that increase the performance of encryption and decryption.
Is it possible to to detect when these instructions are called? For example by writing a kernel module that monitors the instructions that are sent to the CPU? Or is the kernel still to high-level?
My understanding is that instructions like AESENC require no special privileges, so you won't be able to trap them with one of the usual fault handlers even in the kernel. You might be able to do it with a modified version of QEMU or VirtualBox, but that would be a bit of a pain.
I'm guessing you are trying to see if a software package supports AES-NI?
The currently accepted answer is in some sense technically correct. It does not however answer your question.
Theoretically it would be possible to monitor when these instructions are used by any process. This would however create an infeasible amount of overhead. You would basically have to do a conditional branch every time any instruction is executed (!). It can be achieved for example by using an in-target probe, this is however a very costly and non scalable solution.
A more realistic but less precise solution is to do program counter sampling. You can check which instruction is being executed at a given moment by looking at a processes program counter. The program counter of a process can indeed be accessed from a kernel module. If you sample at a high enough resolution you can get a decent indicator of whether or not AES-NI instructions are being used.

The Goroutines and the scheduled

i didn't understand this sentence plz explain me in detail and use an easy english to do that
Go routines are cooperatively scheduled, rather than relying on the kernel to manage their time sharing.
Disclaimer: this is a rough and inaccurate description of the scheduling in the kernel and in the go runtime aimed at explaining the concepts, not at being an exact or detailed explanation of the real system.
As you may (or not know), a CPU can't actually run two programs at the same time: a CPU only have one execution thread, which can execute one instruction at a time. The direct consequence on early systems was that you couldn't run two programs at the same time, each program needing (system-wise) a dedicated thread.
The solution currently adopted is called pseudo-parallelism: given a number of logical threads (e.g multiple programs), the system will execute one of the logical threads during a certain amount of time then switch to the next one. Using really small amounts of time (in the order of milliseconds), you give the human user the illusion of parallelism. This operation is called scheduling.
The Go language doesn't use this system directly: it itself implement a scheduler that run on top of the system scheduler, and schedule the execution of the goroutines itself, bypassing the performance cost of using a real thread for each routine. This type of system is called light/green thread.

How to measure the load balancing in OpenMP of GCC

I am writing a program with GCC OpenMP. And Now I want to check if my OpenMP program has good balanced load. Are there some methods to do this?
BTW, what is the good method to measure the load balancing? (I don't want to use Intel VTune tool.)
I am not sure if this is the right place for my question, any replies are appreciated. And I make the question more detailed.
I am writting OpenMP programs under GCC compiler. And I want to know the details about the overhead of GCC-OpenMP. My concerns are given below.
1) What is the good way to optimize my OpenMP program? There are many aspects that will affect the performance, such as load balancing, locality, scheduling overhead, synchronization, and so on. In which order should I check these aspects.
2) I want to know how to get the load balancing of my application under GCC-OpenMP. How to instrument my application and the OpenMP runtime to extract the load balancing feature?
3) I guess OpenMP will spend some time on scheduling. What runtime APIs should I instrument to get the value of scheduling overhead?
4) Can I measure the time that OpenMP program spend on synchronization, critical, lock and atomic operations?
OmpP is a profiler for OpenMP applications. The profiler reports percentage of the execution spent in critical section and also measures the imbalance at the implicit barriers of different OpenMP constructs
A different approach to measure load imbalance would be to use the likwid-perfctr tool. The tool reports the number of instructions executed in each core. For applications with the same amount of work per thread, a variance in the number of instructions executed in different cores is an indicator of load imbalance.

Resources