I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):
Application startup - Less than 1 Mb total
OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total
CUDA 8.0 context (Driver API) creation 114 Mb total.
If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.
So after OpenGL and CUDA context are up, the process already consumes 114 Mb.
Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.
I would like to know:
Why GL/CUDA context creation consumes such a considerable amount of memory?
Are there ways to optimize that?
The system setup under test:
Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.
I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.
UPDATE
I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):
I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.
Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.
I'm using CUDA (not OpenGL) on GNU/Linux:
CUDA version: 10.2.89
OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
Kernel: Linux 5.2.0
Processor: Intel x86_64
To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):
#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>
static void print_allocation_stats(const char* s)
{
printf("%s:\n", s);
printf("--------------------------------------------------\n");
malloc_stats();
printf("--------------------------------------------------\n\n");
}
int main()
{
display_mallinfo("Initially");
int status = cuInit(0);
if (status != 0 ) { return EXIT_FAILURE; }
print_allocation_stats("After CUDA driver initialization");
int device_id = 0;
unsigned flags = 0;
CUcontext context_id;
status = cuCtxCreate(&context_id, flags, device_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context creation");
status = cuCtxDestroy(context_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context destruction");
return EXIT_SUCCESS;
}
(note that this uses a glibc-specific function, not in the standard library.)
Summarizing the results and snipping irrelevant parts:
Point in program
Total bytes
In-use
Max MMAP Regions
Max MMAP bytes
Initially
135168
1632
0
0
After CUDA driver initialization
552960
439120
2
307200
After context creation
9314304
6858208
8
6643712
After context destruction
7016448
580688
8
6643712
So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.
It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.
Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...
Related
I ran valgrind to one of my open-source OpenCL codes (https://github.com/fangq/mmc), and it detected a lot of memory leaks in the OpenCL host code. Most of those pointed back to the line where I created the context object using clCreateContextFromType.
I double checked all my OpenCL variables, command queues, kernels and programs, and made sure that they are all properly released, but still, when testing on sample programs, every call to the mmc_run_cl() function bumps up memory by 300MB-400MB and won't release at return.
you can reproduce the valgrind report by running the below commands in a terminal:
git clone https://github.com/fangq/mmc.git
cd mmc/src
make clean
make all
cd ../examples/validation
valgrind --show-leak-kinds=all --leak-check=full ../../src/bin/mmc -f cube2.inp -G 1 -s cube2 -n 1e4 -b 0 -D TP -M G -F bin
assuming you system has gcc/git/libOpenCL and valgrind installed. Change the -G 1 input to a different number if you want to run it on other OpenCL devices (add -L to list).
In the below table, I list the repeated count of each valgrind detected leaks on an NVIDIA GPU (TitanV) on a Linux box (Ubuntu 16.04) with the latest driver+cuda 9.
Again, most leaks are associated with the clCreateContextFromType line, which I assume some GPU memories are not released, but I did released all GPU resources at the end of the host code.
do you notice anything that I missed in my host code? your input is much appreciated
counts | error message
------------------------------------------------------------------------------------
380 ==27828== by 0x402C77: main (mmc.c:67)
Code: entry point to the below errors
64 ==27828== by 0x41CF02: mcx_list_gpu (mmc_cl_utils.c:135)
Code: OCL_ASSERT((clGetPlatformIDs(0, NULL, &numPlatforms)));
4 ==27828== by 0x41D032: mcx_list_gpu (mmc_cl_utils.c:154)
Code: context=clCreateContextFromType(cps,devtype[j],NULL,NULL,&status);
58 ==27828== by 0x41DF8A: mmc_run_cl (mmc_cl_host.c:111)
Code: entry point to the below errors
438 ==27828== by 0x41E006: mmc_run_cl (mmc_cl_host.c:124)
Code: OCL_ASSERT(((mcxcontext=clCreateContextFromType(cprops,CL_DEVICE_TYPE_ALL,...));
13 ==27828== by 0x41E238: mmc_run_cl (mmc_cl_host.c:144)
Code: OCL_ASSERT(((mcxqueue[i]=clCreateCommandQueue(mcxcontext,devices[i],prop,&status),status)));
1 ==27828== by 0x41E7A6: mmc_run_cl (mmc_cl_host.c:224)
Code: OCL_ASSERT(((gprogress[0]=clCreateBufferNV(mcxcontext,CL_MEM_READ_WRITE, NV_PIN, ...);
1 ==27828== by 0x41E7F9: mmc_run_cl (mmc_cl_host.c:225)
Code: progress = (cl_uint *)clEnqueueMapBuffer(mcxqueue[0], gprogress[0], CL_TRUE, ...);
10 ==27828== by 0x41EDFA: mmc_run_cl (mmc_cl_host.c:290)
Code: status=clBuildProgram(mcxprogram, 0, NULL, opt, NULL, NULL);
7 ==27828== by 0x41F95C: mmc_run_cl (mmc_cl_host.c:417)
Code: OCL_ASSERT((clEnqueueReadBuffer(mcxqueue[devid],greporter[devid],CL_TRUE,0,...));
Update [04/11/2020]:
Reading #doqtor's comment, I did the following test on 5 difference devices, 2 NVIDIA GPUs, 2 AMD GPUs and 1 Intel CPU. What he said was correct - the memory leak does not happen on the Intel OpenCL library, I also found that AMD OpenCL driver is fine too. The only problem is that NVIDIA OpenCL library seems to have a leak on both GPUs I tested (Titan V and RTX2080).
My test results are below. Memory/CPU profiling using psrecord introduced in this post.
I will open a new question and bounty on how to reduce this memory leak with NVIDIA OpenCL. If you have any experience in this, please share. will post the link below. thanks
I double checked all my OpenCL variables, command queues, kernels and
programs, and made sure that they are all properly released...
Well I still found one (tiny) memory leak in mmc code:
==15320== 8 bytes in 1 blocks are definitely lost in loss record 14 of 1,905
==15320== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15320== by 0x128D48: mmc_run_cl (mmc_cl_host.c:137)
==15320== by 0x11E71E: main (mmc.c:67)
Memory allocated by greporter isn't freed. So that's to be fixed by you.
The rest are potential memory leaks in OpenCL library. They may or may not to be a memory leaks as for example the library may use custom memory allocators which valgrind does not recognizes or does some other tricks. There is a lot threads about that:
clGetPlatformIDs Memory Leak
https://software.intel.com/en-us/forums/opencl/topic/753786
https://github.com/KhronosGroup/OpenCL-ICD-Loader/issues/13
OpenCL clGetPlatformIDs gives around 230 valgrind memcheck errors
In general you can't do much about that unless you want to dive into the library code and do something about that.
I would suggest to carefully suppress those reported which are coming from the library. The suppression file can be generated as described in the valgrind manual: https://valgrind.org/docs/manual/manual-core.html#manual-core.suppress
... but still, when testing on sample programs, every call to the
mmc_run_cl() function bumps up memory by 300MB-400MB and won't release
at return
How did you checked that? I haven't seen memory suspiciously growing. I set -n 1000e4 and it made it to run for like 2 minutes where the memory allocated stayed still for all the time at ~0.6% of my RAM size. Note that I didn't use nvidia CUDA but POCL on Intel GPU and CPU and linked with libOpenCL installed from ocl-icd-libopencl1:amd64 package on Ubuntu 18.04. So you may try to give that a go and check if that changes anything.
======== Update ================================
I've re-run it as you described in the comment and after first iteration the memory usage was 0.6% then after 2nd iteration it increased to 0.9% and after that the next iterations didn't increase memory usage. Valgrind also didn't report anything newer besides what I observed earlier. So I would suggest to link with different than nvidia-cuda libOpenCL and retest.
I have this x86 device and a kernel module that tries to allocate DMA memory. It has a parameter called dmasize that allows to control the size of allocated memory.
I've noticed that allocation succeeds when dmasize=2M but not if larger. Even at boot time.
I heard there was a limitation by CONSISTENT_DMA_SIZE, but looking at lxr, I can't find it for arch x86 kernel 3.2.
Not sure if it is relevant, but this is a 32 bit machine with 8GB of RAM and a pae enabled kernel.
This is the call to dma_alloc_coherent:
dma_addr_t dma_handle;
if (!(_dma_vbase = dma_alloc_coherent(0, alloc_size, &dma_handle, GFP_KERNEL)) || !dma_handle) {
gprintk("_alloc_mpool: Kernel failed to allocate the memory pool of size 0x%lx\n", (unsigned long)alloc_size);
return;
}
Appreciate anyone who can help with this.
Just in case anyone comes across this, the answer is as follows:
The config flag CONFIG_FORCE_MAX_ZONEORDER which defaults at 11 at most architecture is the cause for this limitation.
increasing it to 12 (and recompiling the kernel) fixes the problem.
I suspect using CMA will also be possible but since my kernel doesn't support it, I cannot say for sure.
I'm using Cuda 5.5 under windows, with VS2010, nsight 3.1 and bundled visual profiler.
I have a toy kernel which only do stores and I see different data from nsight and visual profiler. Which should I trust? and why do I get different views?
Nsight says 4.21MB stores and visual profiler says 71402 transactions which represents 8.9MB (assuming all of them are 128B). Consequently, Nsight says BW is 277GB/s and visual profiler 126.69GB/s
I see Nsight data more close to reality, since my dataset is 1024x1024.
EDIT
I have deleted a lot of bad assumptions from my original question. I was thinking somewhat in CPUs and caches coherence.
Access pattern:
each thread performs 4 stores of 1 byte consecutive like this (dst is char*):
for (int i = 0; i < 4; i++) {
dst[offset+i] = 0;
}
There is a difference between Device memory and Global memory. In the programming guide, it says that device memory includes "global, local, shared, constant, or texture memory" (see 5.3.2).
In your first picture, Global loads and stores should be in the first table named L1/Shared Memory (which is not visible in your capture).
I'm having troubles with OpenCL-GL shared memory.
I have a application that's working in both linux and windows. The CL-GL sharing works in linux, but not in windows.
The windows driver says that it supports sharing, the examples from AMD work so it should work. My code for creating the context in windows is:
cl_context_properties properties[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform_(),
CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(),
CL_GL_CONTEXT_KHR, (intptr_t) wglGetCurrentContext(),
0
};
platform_.getDevices(CL_DEVICE_TYPE_GPU, &devices_);
context_ = cl::Context(devices_, properties, &CL::cl_error_callback, nullptr, &err);
err = clGetGLContextInfoKHR(properties, CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR, sizeof(device_id), &device_id, NULL);
context_device_ = cl::Device(device_id);
queue_ = cl::CommandQueue(context_, context_device_, 0, &err);
My problem is that the CL and GL memory in a shared buffer is not the same. I print them out (by memory mapping) and I notice that they differ. Changing the data in the memory works in both CL and GL, but only changes that memory, not both (that is both buffers seems intact, but not shared).
Also, clGetGLObjectInfo on the cl-buffer returns the correct gl buffer.
Update: I have found that if I create the opencl-context on the cpu it works. This seems weird, as I'm not using integrated graphics, and I don't belive the cpu is handling opengl. I'm using SDL to create the window, could that have something to do with this?
I have now confirmed that the opengl context is running on the gpu, so the problem lies elsewhere.
Update 2: Ok, so this is weird. I tried again today, and suddenly it works. As far as I know I didn't install any new drivers before I shut down the computer yesterday, so I don't know what could have brought this about.
Update 3: Right, I noticed that changing the number of particles caused this to work. When I allocated so many particles that the shared buffer is slightly above one MB it suddenly starts to work.
I solved the problem.
OpenGL buffer object must be created "after" OpenCL context was created.
If "before", we can't share the OpenGL data.
I use RadeonHD5670 ATI Catalyst 12.10
Maybe, ATI driver's problem because Nvidia-Computing-SDK samples don't depend on the order.
My program has a custom allocator which gets memory from the OS using mmap(MAP_ANON | MAP_PRIVATE). When it no longer needs memory, the allocator calls either munmap or madvise(MADV_FREE). MADV_FREE keeps the mapping around, but tells the OS that it can throw away the physical pages associated with the mapping.
Calling MADV_FREE on pages you're going to need again eventually is much faster than calling munmap and later calling mmap again.
This almost works perfectly for me. The only problem is that, on MacOS, MADV_FREE is very lazy about getting rid of the pages I've asked it to free. In fact, it only gets rid of them when there's memory pressure from another application. Until it gets rid of the pages I've freed, MacOS reports that my program is still using that memory; in the Activity Monitor, its "Real Memory" column doesn't reflect the freed memory.
This makes it difficult for me to measure how much memory my program is actually using. (This difficulty in measuring RSS is keeping us from landing the custom allocator on 10.5.)
I could allocate a whole bunch of memory to force the OS to free up these pages, but in addition to taking a long time, that could have other side-effects, such as causing parts of my program to be paged out to disk.
On a lark, I tried the purge command, but that has no effect.
How can I force MacOS to clean out these MADV_FREE'd pages? Or, how can I ask MacOS how many MADV_FREE'd pages my process has in memory?
Here's a test program, if it helps. The Activity Monitor's "Real Memory" column shows 512MB after the program goes to sleep. On my Linux box, top shows 256MB of RSS, as desired.
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE (512 * 1024 * 1024)
// We use MADV_FREE on Mac and MADV_DONTNEED on Linux.
#ifndef MADV_FREE
#define MADV_FREE MADV_DONTNEED
#endif
int main()
{
char *x = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
// Touch each page we mmap'ed so it gets a physical page.
int i;
for (i = 0; i < SIZE; i += 1024) {
x[i] = i;
}
madvise(x, SIZE / 2, MADV_FREE);
fprintf(stderr, "Sleeping. Now check my RSS. Hopefully it's %dMB.\n", SIZE / (2 * 1024 * 1024));
sleep(1024);
return 0;
}
mprotect(addr, length, PROT_NONE);
mprotect(addr, length, PROT_READ | PROT_WRITE);
Note as you say, madvise is lazier, and that is probably better for performance (just in case anyone is tempted to use this for performance rather than measurement).
Use MADV_FREE_REUSABLE on macOS. According to Apple's magazine_malloc implementation:
On OS X we use MADV_FREE_REUSABLE, which signals the kernel to remove the given pages from the memory statistics for our process. However, on returning that memory to use we have to signal that it has been reused.
https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html
Chromium, for example, also uses it:
MADV_FREE_REUSABLE is similar to MADV_FREE, but also marks the pages with the reusable bit, which allows both Activity Monitor and memory-infra to correctly track the pages.
https://github.com/chromium/chromium/blob/master/base/memory/discardable_shared_memory.cc#L377
I've looked and looked, and I don't think this is possible. :\
We're solving the problem by adding code to the allocator which explicitly decommits MADV_FREE'd pages when we ask it to.