What is the best way to check for successful allocation of memory when using new in a kernel call with CUDA? Is there anything similar to (nothrow) if there isn't is there a way to continue execution of the kernel, even in the event of memory allocation failure?
Thanks!
I don't think that new is officially supported on the device-side. Moreover - to my knowledge - there is no support for exceptions on the device-side, so annotations like nothrow have no effect.
What you can do in the kernel is to call malloc. Upon the failure the function just returns NULL and you can check that normally.
Do note that
device-side malloc is supported only on devices 2.0 (Fermi) and higher.
By default you have only 8MB of heap memory. If you want to have more, you need to set the higher limit through cudaDeviceSetLimit.
Further reading: CUDA C Programming Guide, v.5.0, chapter B.17 - Dynamic Global Memory Allocation
Update: Tests have shown that new seems to be supported and seems to be working the same way, i.e. returning NULL upon failure.
Related
This code fills some GPU memory and doesn't let it go:
def checkpoint_mem(model_name):
checkpoint = torch.load(model_name)
del checkpoint
torch.cuda.empty_cache()
Printing memory with the following code:
print(torch.cuda.memory_reserved(0))
print(torch.cuda.memory_allocated(0))
shows BEFORE running checkpoint_mem:
0
0
and AFTER:
121634816
97332224
This is with torch.__version__ 1.11.0+cu113 on Google colab.
Does torch.load leak memory? How can I get the GPU memory completely cleared?
It probably doesn't. Also, it depends on what you call memory leak. In this case, after the program ends all memory should be freed, python has a garbage collector, so it might not happen immediately (your del or after leaving the scope) like it does in C++ or similar languages with RAII.
del
del is called by Python and only removes the reference (same as when the object goes out of scope in your function).
torch.nn.Module does not implement del, hence its reference is simply removed.
All of the elements within torch.nn.Module have their references removed recursively (so for each CUDA torch.Tensor instance their __del__ is called).
del on each tensor is a call to release memory
More about __del__
Caching allocator
Another thing - caching allocator occupies part of the memory so it doesn't have to rival other apps in need of CUDA when you are going to use it.
Also, I assume PyTorch is loaded lazily, hence you get 0 MB used at the very beginning, but AFAIK PyTorch itself, during startup, reserves some part of CUDA memory.
The short story is given here, longer one here in case you didn’t see it already.
Possible experiments
You may try to run time.sleep(5) after your function and measure afterwards.
You can get snapshot of the allocator state via torch.cuda.memory_snapshot to get more info about allocator’s reserved memory and inner workings.
You might set the environment variable PYTORCH_NO_CUDA_MEMORY_CACHING=1 and see whether and if anything changes.
Disclaimer
Not a CUDA expert by any means, so someone with more insight could probably expand (and/or correct) my current understanding as I am sure way more things happen under the hood.
It is not possible, see here for the same question and the response from PyTorch developer:
https://github.com/pytorch/pytorch/issues/37664
I am writing custom linux driver that needs to DMA memory around between multiple PCIE devices. I have the following situation:
I'm using dma_alloc_coherent to allocate memory for DeviceA
I then use DeviceA to fill the memory buffer.
Everything is fine so far but at this point I would like to DMA the
memory to DeviceB and I'm not sure the proper way of doing it.
For now I am calling dma_map_single for DeviceB using the
address returned from dma_alloc_coherent called on DeviceA. This
seems to work fine in x86_64 but it feels like I'm breaking the rules
because:
dma_map_single is supposed to be called with memory allocated from kmalloc ("and friends"). Is it problem being called with an address returned from another device's dma_alloc_coherent call?
If #1 is "ok", then I'm still not sure if it is necessary to call the dma_sync_* functions which are needed for dma_map_single memory. Since the memory was originally allocated from dma_alloc_coherent, it should be uncached memory so I believe the answer is "dma_sync_* calls are not necessary", but I am not sure.
I'm worried that I'm just getting lucky having this work and a future
kernel update will break me since it is unclear if I'm following the API rules correctly.
My code eventually will have to run on ARM and PPC too, so I need to make sure I'm doing things in a platform independent manner instead of getting by with some x86_64 architecture hack.
I'm using this as a reference:
https://www.kernel.org/doc/html/latest/core-api/dma-api.html
dma_alloc_coherent() acts similarly to __get_free_pages() but as size granularity rather page, so no issue I would guess here.
First call dma_mapping_error() after dma_map_single() for any platform specific issue. dma_sync_*() helpers are used by streaming DMA operation to keep device and CPU in sync. At minimum dma_sync_single_for_cpu() is required as device modified buffers access state need to be sync before CPU use it.
I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.
You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.
Mine is a legacy code and somewhere long time back memory leak was introduced.
I am using
_CrtSetDbgFlag ( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF );
which implicitly calls _CrtDumpMemoryLeaks(); upon termination/exit from anywhere in code. I am aware of the fact that if we call _CrtDumpMemoryLeaks(); explicitly and some global object is there then memory leak will be reported. That is why not calling _CrtDumpMemoryLeaks() directly.
Get memory leaks and since I am using "new' throughout thecode exact line number and file name are not coming (inspite of declaring #define _CRTDBG_MAP_ALLOC). I have 3 options:
1. Use VLD: but it does not report any leaks.
2. Override new operator so that CRT works. But get error that new is redefined. actually since it is a huge code base, some other place has overriden it conflicts are arising
3. Use number in curly braces and use _crtbreakalloc , but that number is not stable across runs. thus cannot use this strategy as well.
Please help me resolve this issue. Any better mem leak detecting tools?? I used Valgrind on Linux as well. It also doesnot report any mem leak. Only CRT debug reports.
I'm continue my work on the FGPA driver.
Now I'm adding OpenCL support. So I have a following test.
It's just add NUM_OF_EXEC times write and read requests of same buffers and after that waits for completion.
Each write/read request serialized in driver and sequentially executed as DMA transaction. DMA related code can be viewed here.
So the driver takes a transaction, execute it (rsp_setup_dma and fpga_push_data_to_device), waits for interrupt from FPGA (fpga_int_handler), release resources (fpga_finish_dma_write) and begin a new one. When NUM_OF_EXEC equals to 1, all seems to work, but if I increase it, problem appears. At some point get_user_pages (at rsp_setup_dma) returns -EFAULT. Debugging the kernel, I found out, that allocated vma doesn't have VM_GROWSDOWN flag set (at find_extend_vma in mmap.c). But at this point I stuck, because neither I'm sure that I understand why this flag is needed, neither I have an idea why it is not set. Why can get_user_pages fail with the above symptomps? How can I debug this?
On some architectures the stack grows up and on others the stack grows down. See hppa and hppa64 for the weirdos that created the need for such a flag.
So whenever you have to deal with setting up the stack for a kernel thread or process you'll have to provide the direction in which the stack grows as well.