OpenCL clCreateContextFromType function results in memory leaks - performance

I ran valgrind to one of my open-source OpenCL codes (https://github.com/fangq/mmc), and it detected a lot of memory leaks in the OpenCL host code. Most of those pointed back to the line where I created the context object using clCreateContextFromType.
I double checked all my OpenCL variables, command queues, kernels and programs, and made sure that they are all properly released, but still, when testing on sample programs, every call to the mmc_run_cl() function bumps up memory by 300MB-400MB and won't release at return.
you can reproduce the valgrind report by running the below commands in a terminal:
git clone https://github.com/fangq/mmc.git
cd mmc/src
make clean
make all
cd ../examples/validation
valgrind --show-leak-kinds=all --leak-check=full ../../src/bin/mmc -f cube2.inp -G 1 -s cube2 -n 1e4 -b 0 -D TP -M G -F bin
assuming you system has gcc/git/libOpenCL and valgrind installed. Change the -G 1 input to a different number if you want to run it on other OpenCL devices (add -L to list).
In the below table, I list the repeated count of each valgrind detected leaks on an NVIDIA GPU (TitanV) on a Linux box (Ubuntu 16.04) with the latest driver+cuda 9.
Again, most leaks are associated with the clCreateContextFromType line, which I assume some GPU memories are not released, but I did released all GPU resources at the end of the host code.
do you notice anything that I missed in my host code? your input is much appreciated
counts | error message
------------------------------------------------------------------------------------
380 ==27828== by 0x402C77: main (mmc.c:67)
Code: entry point to the below errors
64 ==27828== by 0x41CF02: mcx_list_gpu (mmc_cl_utils.c:135)
Code: OCL_ASSERT((clGetPlatformIDs(0, NULL, &numPlatforms)));
4 ==27828== by 0x41D032: mcx_list_gpu (mmc_cl_utils.c:154)
Code: context=clCreateContextFromType(cps,devtype[j],NULL,NULL,&status);
58 ==27828== by 0x41DF8A: mmc_run_cl (mmc_cl_host.c:111)
Code: entry point to the below errors
438 ==27828== by 0x41E006: mmc_run_cl (mmc_cl_host.c:124)
Code: OCL_ASSERT(((mcxcontext=clCreateContextFromType(cprops,CL_DEVICE_TYPE_ALL,...));
13 ==27828== by 0x41E238: mmc_run_cl (mmc_cl_host.c:144)
Code: OCL_ASSERT(((mcxqueue[i]=clCreateCommandQueue(mcxcontext,devices[i],prop,&status),status)));
1 ==27828== by 0x41E7A6: mmc_run_cl (mmc_cl_host.c:224)
Code: OCL_ASSERT(((gprogress[0]=clCreateBufferNV(mcxcontext,CL_MEM_READ_WRITE, NV_PIN, ...);
1 ==27828== by 0x41E7F9: mmc_run_cl (mmc_cl_host.c:225)
Code: progress = (cl_uint *)clEnqueueMapBuffer(mcxqueue[0], gprogress[0], CL_TRUE, ...);
10 ==27828== by 0x41EDFA: mmc_run_cl (mmc_cl_host.c:290)
Code: status=clBuildProgram(mcxprogram, 0, NULL, opt, NULL, NULL);
7 ==27828== by 0x41F95C: mmc_run_cl (mmc_cl_host.c:417)
Code: OCL_ASSERT((clEnqueueReadBuffer(mcxqueue[devid],greporter[devid],CL_TRUE,0,...));
Update [04/11/2020]:
Reading #doqtor's comment, I did the following test on 5 difference devices, 2 NVIDIA GPUs, 2 AMD GPUs and 1 Intel CPU. What he said was correct - the memory leak does not happen on the Intel OpenCL library, I also found that AMD OpenCL driver is fine too. The only problem is that NVIDIA OpenCL library seems to have a leak on both GPUs I tested (Titan V and RTX2080).
My test results are below. Memory/CPU profiling using psrecord introduced in this post.
I will open a new question and bounty on how to reduce this memory leak with NVIDIA OpenCL. If you have any experience in this, please share. will post the link below. thanks

I double checked all my OpenCL variables, command queues, kernels and
programs, and made sure that they are all properly released...
Well I still found one (tiny) memory leak in mmc code:
==15320== 8 bytes in 1 blocks are definitely lost in loss record 14 of 1,905
==15320== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15320== by 0x128D48: mmc_run_cl (mmc_cl_host.c:137)
==15320== by 0x11E71E: main (mmc.c:67)
Memory allocated by greporter isn't freed. So that's to be fixed by you.
The rest are potential memory leaks in OpenCL library. They may or may not to be a memory leaks as for example the library may use custom memory allocators which valgrind does not recognizes or does some other tricks. There is a lot threads about that:
clGetPlatformIDs Memory Leak
https://software.intel.com/en-us/forums/opencl/topic/753786
https://github.com/KhronosGroup/OpenCL-ICD-Loader/issues/13
OpenCL clGetPlatformIDs gives around 230 valgrind memcheck errors
In general you can't do much about that unless you want to dive into the library code and do something about that.
I would suggest to carefully suppress those reported which are coming from the library. The suppression file can be generated as described in the valgrind manual: https://valgrind.org/docs/manual/manual-core.html#manual-core.suppress
... but still, when testing on sample programs, every call to the
mmc_run_cl() function bumps up memory by 300MB-400MB and won't release
at return
How did you checked that? I haven't seen memory suspiciously growing. I set -n 1000e4 and it made it to run for like 2 minutes where the memory allocated stayed still for all the time at ~0.6% of my RAM size. Note that I didn't use nvidia CUDA but POCL on Intel GPU and CPU and linked with libOpenCL installed from ocl-icd-libopencl1:amd64 package on Ubuntu 18.04. So you may try to give that a go and check if that changes anything.
======== Update ================================
I've re-run it as you described in the comment and after first iteration the memory usage was 0.6% then after 2nd iteration it increased to 0.9% and after that the next iterations didn't increase memory usage. Valgrind also didn't report anything newer besides what I observed earlier. So I would suggest to link with different than nvidia-cuda libOpenCL and retest.

Related

Arduino Verify Issue storage

I am trying to practice a bit using FreeRTOS on my Arduino. I believe I installed the libraries correctly as well as executed the code. When I try to verify on my IDE Arduino, I get the following error. At first, I thought
I needed to update the IDE on my macOS and all the libraries to help with storage, but I am still getting the error.
"text section exceeds available space in boardSketch uses 46768 bytes (144%) of program storage space. Maximum is 32256 bytes.
Global variables use 1572 bytes (76%) of dynamic memory, leaving 476 bytes for local variables. Maximum is 2048 bytes.
Sketch too big; see https://support.arduino.cc/hc/en-us/articles/360013825179 for tips on reducing it.
Error compiling for board Arduino Uno."

Julia 1.1 with JLD HDF5 package and memory release in Windows

I'm using Julia 1.1 with JLD and HDF5 to save a file onto the disk, where I met a couple of question about the memory usage.
Issue 1:
First, I defined a 4 GB matrix A.
A = zeros(ComplexF64,(243,243,4000));
When I type the command and look at windows task manager:
A=nothing
It took several minutes for Julia to release those memory back to me. Most of the time, (In Task manager) Julia just doesn't release the memory usage at all, even though the command returned results saying that A occupied 0 bytes instantly.
varinfo()
name size summary
–––––––––––––––– ––––––––––– –––––––
A 0 bytes Nothing
Base Module
Core Module
InteractiveUtils 162.930 KiB Module
Main Module
ans 0 bytes Nothing
Issue 2:
Further, when I tried to use JLD and HDF5 to save file onto the disk. This time, the task manager told me that, when using the save("test.jld", "A", A) command, an extra 4GB memory was used.
using JLD,HDF5
A = zeros(ComplexF64,(243,243,4000));
save("test.jld", "A", A)
Further, after I typed
A=nothing
Julia won't release the 8 GB memory back to me.
Finding 3:
An interesting thing I found was that, if I retype the command
A = zeros(ComplexF64,(243,243,4000));
The task manager would told me the cashed memory was released, and the total memory usage was again only 4GB.
Question 1:
What's going on with memory management in Julia? Was it just a mistake by Windows, or some command in Julia? How to check the Julia memory usage instantly?
Question 2:
How to tell the Julia to instantly release the memory usage?
Question 3:
Is there a way to tell JLD package not use those extra 4GB meomory?
(Better, could someone tell me how to create A directly on the disk without even creating it in the memory? I knew there's memory mapped I/O in JLD package. I have tried it, but it seemed to require me to create matrix A in the memory and save A onto the disk first, before I could recall the memory mapped A again. )
This is a long question, so thanks ahead!
Julia uses garbage collector to de-alocate the memory. Usually a garbage collector does not run after every line of code but only when needed.
Try to force garbage collection by running the command:
GC.gc()
This releases memory space for unreferenced Julia objects. In this way you can check whether the memory actually has been released.
Side note: JLD used to be somewhat not-always-working (I do not know the current status). Hence you first consideration for non-cross-platform object persistence always should be the serialize function from the in-built Serialization package - check the documentation at https://docs.julialang.org/en/v1/stdlib/Serialization/index.html#Serialization.serialize

Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode?

I want to play around with cache sizes in my gem5 simulator to see how it affects performance of programs, and possibly tune programs at runtime.
As a sanity check, I tried to check that the command lines arguments I used were working , and so I tried the various methods proposed at: https://superuser.com/questions/55776/finding-l2-cache-size-in-linux/1298808#1298808
cat /sys/devices/system/cpu/cpu0/cache/index2/size
getconf LEVEL2_CACHE_SIZE
But I observed that:
the file /sys/devices/system/cpu/cpu0/cache/index2/size does not exist
getconf is empty
Why is that?
I am certain however that the caches are being, since I've benchmarked simple programs, and the cycle counts increase when I decrease the caches.
For example, my base command is:
M5_PATH='/data/git/linux-kernel-module-cheat/gem5/gem5-system' '/data/git/linux-kernel-module-cheat/gem5/gem5/build/ARM/gem5.opt' '/data/git/linux-kernel-module-cheat/gem5/gem5/configs/example/fs.py' --command-line='earlyprintk=pl011,0x1c090000 console=ttyAMA0 lpj=19988480 rw loglevel=8 mem=512MB root=/dev/sda nokaslr norandmaps printk.devkmsg=on printk.time=y' --disk-image='/data/git/linux-kernel-module-cheat/buildroot/output.arm-gem5~/images/rootfs.ext2' --dtb-file='/data/git/linux-kernel-module-cheat/gem5/gem5/system/arm/dt/armv7_gem5_v1_1cpu.dtb' --kernel='/data/git/linux-kernel-module-cheat/buildroot/output.arm-gem5~/build/linux-custom/vmlinux' --machine-type=VExpress_GEM5_V1 --num-cpus=1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024 --cpu-type=HPI
With those tiny caches, running the following:
m5 resetstats && dhrystone 10000 && m5 dumpstats
takes 175M cycles, and only 16M cycles if I use the exact same command but with huge caches of size 1024MB.
I observe a similar behavior for x86.
I'm using this testing infrastructure: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/05d8a324f74849f03404eb847f8da748e2e4502c#gem5-change-system-parameters which implies:
gem5 commit: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
Linux kernel v4.15 with configuration: https://github.com/cirosantilli/linux-kernel-module-cheat/blob/05d8a324f74849f03404eb847f8da748e2e4502c/kernel_config_arm-gem5
Related thread on the mailing list: http://gem5-users.gem5.narkive.com/4xVBlf3c/verify-cache-configuration
For comparison, QEMU v2.11.0 x86 did show the cache sizes, but not the ARM one.
Maybe for ARM we would need to modify the bootloaders to tell that to kernel? But I don't know how those things work very well:
https://github.com/gem5/gem5/blob/fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4/system/arm/simple_bootloader/simple.S
https://github.com/gem5/gem5/blob/fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4/system/arm/aarch64_bootloader/boot.S
I have been told that:
gem5 doesn't implement the cache size discovery registers.
The problem is that it is really hard to configure them in the general case, and they might not even be able to represent the hierarchy in gem5.

Why are OpenGL and CUDA contexts memory greedy?

I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):
Application startup - Less than 1 Mb total
OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total
CUDA 8.0 context (Driver API) creation 114 Mb total.
If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.
So after OpenGL and CUDA context are up, the process already consumes 114 Mb.
Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.
I would like to know:
Why GL/CUDA context creation consumes such a considerable amount of memory?
Are there ways to optimize that?
The system setup under test:
Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.
I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.
UPDATE
I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):
I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.
Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.
I'm using CUDA (not OpenGL) on GNU/Linux:
CUDA version: 10.2.89
OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
Kernel: Linux 5.2.0
Processor: Intel x86_64
To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):
#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>
static void print_allocation_stats(const char* s)
{
printf("%s:\n", s);
printf("--------------------------------------------------\n");
malloc_stats();
printf("--------------------------------------------------\n\n");
}
int main()
{
display_mallinfo("Initially");
int status = cuInit(0);
if (status != 0 ) { return EXIT_FAILURE; }
print_allocation_stats("After CUDA driver initialization");
int device_id = 0;
unsigned flags = 0;
CUcontext context_id;
status = cuCtxCreate(&context_id, flags, device_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context creation");
status = cuCtxDestroy(context_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context destruction");
return EXIT_SUCCESS;
}
(note that this uses a glibc-specific function, not in the standard library.)
Summarizing the results and snipping irrelevant parts:
Point in program
Total bytes
In-use
Max MMAP Regions
Max MMAP bytes
Initially
135168
1632
0
0
After CUDA driver initialization
552960
439120
2
307200
After context creation
9314304
6858208
8
6643712
After context destruction
7016448
580688
8
6643712
So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.
It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.
Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...

What causes this error: "address already known to kernel for another [busy] synchronizer type"?

I have a customer who is getting their system log flooded with thousands of copies of this message:
Jul 25 11:21:33 athayer-mbp13 kernel[0]: PSYNCH: pid[52893]: address already known to kernel for another [busy] synchronizer type
The culprit is my app, but I can’t reproduce the problem and don’t have much of a clue to its cause. My app does disk searching, and this error happens about 15 hours into the life of the process. There is no excessive memory usage or file descriptor leakage. The app continues to operate normally, it’s just that these messages cause the system log to blow up to gigabyte proportions and fill up the boot disk.
I found the Darwin kernel code where the message is printed, but it’s only a clue, it doesn’t show the smoking gun:
http://opensource.apple.com//source/xnu/xnu-1699.32.7/bsd/kern/pthread_support.c
FAILEDUSERTEST("address already known to kernel for another (busy) synchronizer type\n”);
It’s in this function:
/* find kernel waitqueue, if not present create one. Grants a reference */
int
ksyn_wqfind(user_addr_t mutex, uint32_t mgen, uint32_t ugen, uint32_t rw_wc, uint64_t tid, int flags, int wqtype, ksyn_wait_queue_t * kwqp)
Can anyone provide any insight into what’s going on?
Here’s the profile for the machine:
Model Name: MacBook Pro
Model Identifier: MacBookPro12,1
Processor Name: Intel Core i5
Processor Speed: 2.7 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 3 MB
Memory: 8 GB
Boot ROM Version: MBP121.0167.B16
SMC Version (system): 2.28f7
Hardware UUID: 9205D058-90BF-541E-8E61-E75259ABC11F
System Software Overview:
System Version: OS X 10.11.4 (15E65)
Kernel Version: Darwin 15.4.0
Boot Volume: Macintosh HD
Boot Mode: Normal
Computer Name: athayer-mbp13
User Name: System Administrator (root)
Secure Virtual Memory: Enabled
system_integrity: integrity_enabled
Time since boot: 9 days 18:55
Possible Explanation
It's possible that you're being affected by an old kernel bug. If a pthread condition variable (the main component of a standard pthread_mutex family object) is allocated, but never waited on, there is a situation in which its object is never removed from a pthreads-internal registry on OSX.
If that happens, and if another mutex is later allocated that happens to end up in the same space in memory, and if that mutex is waited on, this error can occur, since the new mutex's ID will not match the one already present in its space. This is distinct from a reallocation issue where garbled/meaningless info is found instead of a valid ID.
Workaround
The workaround is to ensure that you are calling a a wait function on all mutexes/condvars you create. Even a nanosecond wait will trigger "correct" destruction when it completes on a no-longer-used mutex. An example of the fix by the Chromium devs is linked below.
For example, you could wait one nanosecond/tick on a lock thus:
struct timespec time { .tv_sec = 0, .tv_nsec = 1 };
pthread_cond_timedwait_relative_np(
&some_condition_handle,
&some_lock_handle,
time
);
Confounding Factors
The kernel bug may not be the real issue. There are a lot of confounding factors here:
The kernel source hasn't been published for 10.10 or 10.11, so the code being called that generates that error may not be the code that you found online.
As a result of that, the kernel bug I mentioned may not still exist, or may not be reachable in the same way.
The error line you published has parens (()) around the word "busy", but the source you found has square brackets ([]). The places in code that print out the two different messages are distinct from each other, so the problem lines might not be the ones you pointed out in your question.
Relevant Links
Article by the first (only?) person who has diagnosed this issue: http://rayne3d.com/blog/02-27-2014-rayne-weekly-devblog-4
The problem gets exhibited in the pthread source (or it was, in pthread 105.1.4), visible at this link (search in the page for 13782056): https://opensource.apple.com/source/libpthread/libpthread-105.1.4/src/pthread_cond.c
An example fix like the workaround listed above was made by the Chromium team when they were affected by a similar (the same?) issue: https://codereview.chromium.org/1323293005
The original Apple Developer Forum link appears to be defunct, though I might just be unable to access it: https://devforums.apple.com/thread/220316?tstart=0

Resources