Linux perf and MKL - openmp

I've been trying to profile our app (amd64 RHEL 7.6 built with GCC 5.3 and using MKL + OMP). I used perf record, but all I see is a small number of samples in the OMP library. Nothing in main() or below. This is with one 10 minute run and also another that only lasts a second or so.
Is MKL + OMP doing some non-standard threading that perf can't follow?
I'll try running the test and then separately running perf record -p.
Does anyone have experience with perf record and MKL? Maybe VTune will work better!

It seems that the problem was with -f(no-)omit-frame-pointer. I was building with -O3 -g3 and for some reason perf record failed to get the stacks. I thought that -g3 would inhibit -fomit-frame-pointer. Presumably MKL stil has the frame pointers, so perf could get its stack traces.

Related

ALSA external plugin and openmp in C++

I'm creaating an ALSA external module using the gtkIOStream ALSAExternalPlugin class.
In my external plugin code, I am calling the necessary openmp calls :
omp_set_num_threads(omp_get_max_threads());
printf("omp_get_num_threads()=%d\n", omp_get_num_threads());
I am also compiling with the necessary openmp flags and libraries (-fopenmp and -gomp).
However when I run my code using "aplay -DexternalPlugin file" the system reports only one thread in use instead of 20 threads.
Am I missing something ?
The linking flags for compiling the external plugin are like so :
-fopenmp -lgomp -module -avoid-version -export-dynamic -no-undefined
-fopenmp is also in the CPP flags and I can see them at compile time.
Setting the number of threads does not make your code go parallel, so, as written, you are setting the number of threads which will be used by the next parallel region, and then printing the number of threads currently in use, which will, indeed, be one, since you haven't gone parallel.
In general, there is no point in forcing the number of threads, since any sane OpenMP runtime (certainly GCC and LLVM) will use all of the available threads by default.
Just print omp_get_max_threads() to see what will be used.
Of course, looking at machine load externally when running your code is also a way to check this!

OpenCL clCreateContextFromType function results in memory leaks

I ran valgrind to one of my open-source OpenCL codes (https://github.com/fangq/mmc), and it detected a lot of memory leaks in the OpenCL host code. Most of those pointed back to the line where I created the context object using clCreateContextFromType.
I double checked all my OpenCL variables, command queues, kernels and programs, and made sure that they are all properly released, but still, when testing on sample programs, every call to the mmc_run_cl() function bumps up memory by 300MB-400MB and won't release at return.
you can reproduce the valgrind report by running the below commands in a terminal:
git clone https://github.com/fangq/mmc.git
cd mmc/src
make clean
make all
cd ../examples/validation
valgrind --show-leak-kinds=all --leak-check=full ../../src/bin/mmc -f cube2.inp -G 1 -s cube2 -n 1e4 -b 0 -D TP -M G -F bin
assuming you system has gcc/git/libOpenCL and valgrind installed. Change the -G 1 input to a different number if you want to run it on other OpenCL devices (add -L to list).
In the below table, I list the repeated count of each valgrind detected leaks on an NVIDIA GPU (TitanV) on a Linux box (Ubuntu 16.04) with the latest driver+cuda 9.
Again, most leaks are associated with the clCreateContextFromType line, which I assume some GPU memories are not released, but I did released all GPU resources at the end of the host code.
do you notice anything that I missed in my host code? your input is much appreciated
counts | error message
------------------------------------------------------------------------------------
380 ==27828== by 0x402C77: main (mmc.c:67)
Code: entry point to the below errors
64 ==27828== by 0x41CF02: mcx_list_gpu (mmc_cl_utils.c:135)
Code: OCL_ASSERT((clGetPlatformIDs(0, NULL, &numPlatforms)));
4 ==27828== by 0x41D032: mcx_list_gpu (mmc_cl_utils.c:154)
Code: context=clCreateContextFromType(cps,devtype[j],NULL,NULL,&status);
58 ==27828== by 0x41DF8A: mmc_run_cl (mmc_cl_host.c:111)
Code: entry point to the below errors
438 ==27828== by 0x41E006: mmc_run_cl (mmc_cl_host.c:124)
Code: OCL_ASSERT(((mcxcontext=clCreateContextFromType(cprops,CL_DEVICE_TYPE_ALL,...));
13 ==27828== by 0x41E238: mmc_run_cl (mmc_cl_host.c:144)
Code: OCL_ASSERT(((mcxqueue[i]=clCreateCommandQueue(mcxcontext,devices[i],prop,&status),status)));
1 ==27828== by 0x41E7A6: mmc_run_cl (mmc_cl_host.c:224)
Code: OCL_ASSERT(((gprogress[0]=clCreateBufferNV(mcxcontext,CL_MEM_READ_WRITE, NV_PIN, ...);
1 ==27828== by 0x41E7F9: mmc_run_cl (mmc_cl_host.c:225)
Code: progress = (cl_uint *)clEnqueueMapBuffer(mcxqueue[0], gprogress[0], CL_TRUE, ...);
10 ==27828== by 0x41EDFA: mmc_run_cl (mmc_cl_host.c:290)
Code: status=clBuildProgram(mcxprogram, 0, NULL, opt, NULL, NULL);
7 ==27828== by 0x41F95C: mmc_run_cl (mmc_cl_host.c:417)
Code: OCL_ASSERT((clEnqueueReadBuffer(mcxqueue[devid],greporter[devid],CL_TRUE,0,...));
Update [04/11/2020]:
Reading #doqtor's comment, I did the following test on 5 difference devices, 2 NVIDIA GPUs, 2 AMD GPUs and 1 Intel CPU. What he said was correct - the memory leak does not happen on the Intel OpenCL library, I also found that AMD OpenCL driver is fine too. The only problem is that NVIDIA OpenCL library seems to have a leak on both GPUs I tested (Titan V and RTX2080).
My test results are below. Memory/CPU profiling using psrecord introduced in this post.
I will open a new question and bounty on how to reduce this memory leak with NVIDIA OpenCL. If you have any experience in this, please share. will post the link below. thanks
I double checked all my OpenCL variables, command queues, kernels and
programs, and made sure that they are all properly released...
Well I still found one (tiny) memory leak in mmc code:
==15320== 8 bytes in 1 blocks are definitely lost in loss record 14 of 1,905
==15320== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15320== by 0x128D48: mmc_run_cl (mmc_cl_host.c:137)
==15320== by 0x11E71E: main (mmc.c:67)
Memory allocated by greporter isn't freed. So that's to be fixed by you.
The rest are potential memory leaks in OpenCL library. They may or may not to be a memory leaks as for example the library may use custom memory allocators which valgrind does not recognizes or does some other tricks. There is a lot threads about that:
clGetPlatformIDs Memory Leak
https://software.intel.com/en-us/forums/opencl/topic/753786
https://github.com/KhronosGroup/OpenCL-ICD-Loader/issues/13
OpenCL clGetPlatformIDs gives around 230 valgrind memcheck errors
In general you can't do much about that unless you want to dive into the library code and do something about that.
I would suggest to carefully suppress those reported which are coming from the library. The suppression file can be generated as described in the valgrind manual: https://valgrind.org/docs/manual/manual-core.html#manual-core.suppress
... but still, when testing on sample programs, every call to the
mmc_run_cl() function bumps up memory by 300MB-400MB and won't release
at return
How did you checked that? I haven't seen memory suspiciously growing. I set -n 1000e4 and it made it to run for like 2 minutes where the memory allocated stayed still for all the time at ~0.6% of my RAM size. Note that I didn't use nvidia CUDA but POCL on Intel GPU and CPU and linked with libOpenCL installed from ocl-icd-libopencl1:amd64 package on Ubuntu 18.04. So you may try to give that a go and check if that changes anything.
======== Update ================================
I've re-run it as you described in the comment and after first iteration the memory usage was 0.6% then after 2nd iteration it increased to 0.9% and after that the next iterations didn't increase memory usage. Valgrind also didn't report anything newer besides what I observed earlier. So I would suggest to link with different than nvidia-cuda libOpenCL and retest.

Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode?

I want to play around with cache sizes in my gem5 simulator to see how it affects performance of programs, and possibly tune programs at runtime.
As a sanity check, I tried to check that the command lines arguments I used were working , and so I tried the various methods proposed at: https://superuser.com/questions/55776/finding-l2-cache-size-in-linux/1298808#1298808
cat /sys/devices/system/cpu/cpu0/cache/index2/size
getconf LEVEL2_CACHE_SIZE
But I observed that:
the file /sys/devices/system/cpu/cpu0/cache/index2/size does not exist
getconf is empty
Why is that?
I am certain however that the caches are being, since I've benchmarked simple programs, and the cycle counts increase when I decrease the caches.
For example, my base command is:
M5_PATH='/data/git/linux-kernel-module-cheat/gem5/gem5-system' '/data/git/linux-kernel-module-cheat/gem5/gem5/build/ARM/gem5.opt' '/data/git/linux-kernel-module-cheat/gem5/gem5/configs/example/fs.py' --command-line='earlyprintk=pl011,0x1c090000 console=ttyAMA0 lpj=19988480 rw loglevel=8 mem=512MB root=/dev/sda nokaslr norandmaps printk.devkmsg=on printk.time=y' --disk-image='/data/git/linux-kernel-module-cheat/buildroot/output.arm-gem5~/images/rootfs.ext2' --dtb-file='/data/git/linux-kernel-module-cheat/gem5/gem5/system/arm/dt/armv7_gem5_v1_1cpu.dtb' --kernel='/data/git/linux-kernel-module-cheat/buildroot/output.arm-gem5~/build/linux-custom/vmlinux' --machine-type=VExpress_GEM5_V1 --num-cpus=1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024 --cpu-type=HPI
With those tiny caches, running the following:
m5 resetstats && dhrystone 10000 && m5 dumpstats
takes 175M cycles, and only 16M cycles if I use the exact same command but with huge caches of size 1024MB.
I observe a similar behavior for x86.
I'm using this testing infrastructure: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/05d8a324f74849f03404eb847f8da748e2e4502c#gem5-change-system-parameters which implies:
gem5 commit: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
Linux kernel v4.15 with configuration: https://github.com/cirosantilli/linux-kernel-module-cheat/blob/05d8a324f74849f03404eb847f8da748e2e4502c/kernel_config_arm-gem5
Related thread on the mailing list: http://gem5-users.gem5.narkive.com/4xVBlf3c/verify-cache-configuration
For comparison, QEMU v2.11.0 x86 did show the cache sizes, but not the ARM one.
Maybe for ARM we would need to modify the bootloaders to tell that to kernel? But I don't know how those things work very well:
https://github.com/gem5/gem5/blob/fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4/system/arm/simple_bootloader/simple.S
https://github.com/gem5/gem5/blob/fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4/system/arm/aarch64_bootloader/boot.S
I have been told that:
gem5 doesn't implement the cache size discovery registers.
The problem is that it is really hard to configure them in the general case, and they might not even be able to represent the hierarchy in gem5.

Profiling OpenMP: update_cfs_rq_blocked_load takes majority of runtime

I am profiling some OpenMP code using Intel's amplxe. It's giving me some confusing results and I am not sure if something weird is going on with the profiler.
I use __itt_pause(); and __itt_resume(); to isolate the OpenMP section of the code, but my bottom-up view looks like this:
_raw_spin_lock vmlinux 23.474s
update_cfs_rq_blocked_load vmlinux 15.215s
__schedule vmlinux 10.695s
put_prev_task_fair vmlinux 7.918s
update_cfs_shares vmlinux 7.447s
Could this be the effect of overparallelised OpenMP region or is it something weird that my profiler is reporting? I was expecting to see more OpenMP related information in the profiler result, but they are all at the bottom and the runtime is dominated by kernel based routines.
Could someone explain if those are likely things one would see in OpenMP profiles and what exactly do the cfs functions do? (My understanding is that... there's a lot of locking going on in the scheduler!?!)

C++ Compilation flags for an R package in Windows/Mac

I developed an R package which calls C++ code through Rcpp and RcppEigen. My Makevars.win looks like this (the enumeration is meant to refer to my questions)
CXX_STD = CXX11
PKG_CPPFLAGS = -fopenmp -O3 -Wall -ftree-vectorize -march=native -mavx -mfma
PKG_CXXFLAGS += $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = -fopenmp
PKG_LIBS += $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS)
PKG_CPPFLAGS += -I../inst/include/
as I want to use OpenMP and link the R package against Intel MKL library. I am also adding in my source files the plugins // [[Rcpp::plugins(cpp11)]] and // [[Rcpp::plugins(openmp)]].
When I compile the package everything works fine but I am still getting the default compilation flags -O2 and -std=c++0x. So my questions are:
A. isn't 1. supposed to force -std=c++11 (by the way, using the same Makevars yields the right C++ version, so there must be something specific to Windows)?
B. does 3 repeats fopenmp in 2?
C. how to check whether 5. has been taken into account? I am asking this as the same package built on Mac is much faster than on Windows while their configurations are the same. I have done some benchmark of the same code on Windows using Microsoft R Open and Mac, and Windows was faster in that case.
Thank you very much for your very precious help.
Where to start?
First off, compilation and linking options are based on the union of R's Makeconf and you package's src/Makevars. You can add to value, you cannot replace.
Second, and related, which BLAS you get is a system setup issue. You cannot generally govern that from your package.
Third, plugins for sourceCpp() and cppFunction(). In packages you make direct declarations, ie CXX_STD=CXX11.
Fourth, there are almost 1000 packages on CRAN using Rcpp. Sometimes it helps just to look at what some of these do. Many employ OpenMP.
Fifth, OpenMP is severely challenging on OS X thanks to Apple. I've forgotten what the Windows situation is. It just works on Linux.

Resources