How to free all GPU memory from pytorch.load? - memory-management

This code fills some GPU memory and doesn't let it go:
def checkpoint_mem(model_name):
checkpoint = torch.load(model_name)
del checkpoint
torch.cuda.empty_cache()
Printing memory with the following code:
print(torch.cuda.memory_reserved(0))
print(torch.cuda.memory_allocated(0))
shows BEFORE running checkpoint_mem:
0
0
and AFTER:
121634816
97332224
This is with torch.__version__ 1.11.0+cu113 on Google colab.
Does torch.load leak memory? How can I get the GPU memory completely cleared?

It probably doesn't. Also, it depends on what you call memory leak. In this case, after the program ends all memory should be freed, python has a garbage collector, so it might not happen immediately (your del or after leaving the scope) like it does in C++ or similar languages with RAII.
del
del is called by Python and only removes the reference (same as when the object goes out of scope in your function).
torch.nn.Module does not implement del, hence its reference is simply removed.
All of the elements within torch.nn.Module have their references removed recursively (so for each CUDA torch.Tensor instance their __del__ is called).
del on each tensor is a call to release memory
More about __del__
Caching allocator
Another thing - caching allocator occupies part of the memory so it doesn't have to rival other apps in need of CUDA when you are going to use it.
Also, I assume PyTorch is loaded lazily, hence you get 0 MB used at the very beginning, but AFAIK PyTorch itself, during startup, reserves some part of CUDA memory.
The short story is given here, longer one here in case you didn’t see it already.
Possible experiments
You may try to run time.sleep(5) after your function and measure afterwards.
You can get snapshot of the allocator state via torch.cuda.memory_snapshot to get more info about allocator’s reserved memory and inner workings.
You might set the environment variable PYTORCH_NO_CUDA_MEMORY_CACHING=1 and see whether and if anything changes.
Disclaimer
Not a CUDA expert by any means, so someone with more insight could probably expand (and/or correct) my current understanding as I am sure way more things happen under the hood.

It is not possible, see here for the same question and the response from PyTorch developer:
https://github.com/pytorch/pytorch/issues/37664

Related

Julia 1.1 with JLD HDF5 package and memory release in Windows

I'm using Julia 1.1 with JLD and HDF5 to save a file onto the disk, where I met a couple of question about the memory usage.
Issue 1:
First, I defined a 4 GB matrix A.
A = zeros(ComplexF64,(243,243,4000));
When I type the command and look at windows task manager:
A=nothing
It took several minutes for Julia to release those memory back to me. Most of the time, (In Task manager) Julia just doesn't release the memory usage at all, even though the command returned results saying that A occupied 0 bytes instantly.
varinfo()
name size summary
–––––––––––––––– ––––––––––– –––––––
A 0 bytes Nothing
Base Module
Core Module
InteractiveUtils 162.930 KiB Module
Main Module
ans 0 bytes Nothing
Issue 2:
Further, when I tried to use JLD and HDF5 to save file onto the disk. This time, the task manager told me that, when using the save("test.jld", "A", A) command, an extra 4GB memory was used.
using JLD,HDF5
A = zeros(ComplexF64,(243,243,4000));
save("test.jld", "A", A)
Further, after I typed
A=nothing
Julia won't release the 8 GB memory back to me.
Finding 3:
An interesting thing I found was that, if I retype the command
A = zeros(ComplexF64,(243,243,4000));
The task manager would told me the cashed memory was released, and the total memory usage was again only 4GB.
Question 1:
What's going on with memory management in Julia? Was it just a mistake by Windows, or some command in Julia? How to check the Julia memory usage instantly?
Question 2:
How to tell the Julia to instantly release the memory usage?
Question 3:
Is there a way to tell JLD package not use those extra 4GB meomory?
(Better, could someone tell me how to create A directly on the disk without even creating it in the memory? I knew there's memory mapped I/O in JLD package. I have tried it, but it seemed to require me to create matrix A in the memory and save A onto the disk first, before I could recall the memory mapped A again. )
This is a long question, so thanks ahead!
Julia uses garbage collector to de-alocate the memory. Usually a garbage collector does not run after every line of code but only when needed.
Try to force garbage collection by running the command:
GC.gc()
This releases memory space for unreferenced Julia objects. In this way you can check whether the memory actually has been released.
Side note: JLD used to be somewhat not-always-working (I do not know the current status). Hence you first consideration for non-cross-platform object persistence always should be the serialize function from the in-built Serialization package - check the documentation at https://docs.julialang.org/en/v1/stdlib/Serialization/index.html#Serialization.serialize

Call to ExAllocatePoolWithTag never returns

I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.
You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.

Detection of freed memory usage (FPC -> heaptrc -> keepreleased)

Free Pascal heaptrc keepreleased is described as "useful if you suspect that the same memory block is released twice" but is it possible to detect usage of previously freed memory (object method call of freed object) with it? If it is impossible - can it be detected with other tools?
Yes, it should do that. The idea is the following:
an used allocation has a different .sig then $AAAAAAAA or $DEADBEEF. On freemem the sig is checked (see around line 593 in trunk) against sig $AAAAAAA IF useCRC is false.
The keepreleased prevents blocks from being reused, which would change the signature to something else then $AAAAAAAA. It will print something like:
Marked memory at $12345678 released
to the file descriptor ptext. The error standard files can be set and directed using various other variables. It looks fairly complicated, but that is probably to deal with consoleless GUI applications
Some other variables (like haltonerror) govern if the application is halted on such corruption
An alternate (but very slow) way is using valgrind (fpc option -gv), but I only have run valgrind on *nix, and as said it is extremely slow, so not for very heavy processing apps.

GlobalFree() causing user breakpoint... memory block is fixed, not locked, single module, no DLL

At one point in my program I am calling GlobalFree() to release a memory buffer I allocated with GlobalAlloc() using GMEM_FIXED flag. There is nothing that could be locking this block. However, when I call GlobalFree() after referencing the data (and all of the internal data is still the same as it was), the program stops and says it has encountered a user breakpoint from code in the GlobalFree() code.
Any ideas what could cause this?
The heap functions typically call DebugBreak() - which implements a user-breakpoint - when they detect that the heap structures have been damaged.
The implication is you have written past the end (or the beginning) of the allocated area.

Memory Usage in R

After creating large objects and running out of RAM, I will try and delete the objects in my current environment using
rm(list=ls())
When I check my RAM usage, nothing has changed. Even after calling gc() nothing has changed. I can only replenish my RAM by quitting R.
Anybody have advice for dealing with memory-intensive objects within R?
Memory for deleted objects is not released immediately. R uses a technique called "garbage collection" to reclaim memory for deleted objects. Periodically, it cycles through the list of accessible objects (basically, those that have names and have not been deleted and can therefore be accessed by the user), and "tags" them for retention. The memory for any untagged objects is returned to the operating system after the garbage-collection sweep.
Garbage collection happens automatically, and you don't have any direct control over this process. But you can force a sweep by calling the command gc() from the command line.
Even then, on some operating systems garbage collection might not reclaim memory (as reported by the OS). Older versions of Windows, for example, could increase but not decrease the memory footprint of R. Garbage collection would only make space for new objects in the future, but would not reduce the memory use of R.
On Windows, the technique you describe works for me. Try the following example.
Open the Windows Task Manager (CTRL+SHIFT+ESC).
Start RGui. RGui.exe mem usage is 27 460K.
Type
gcinfo(TRUE)
x <- rnorm(1e8)
RGui.exe mem usage is now 811 100K.
Type rm("x"). RGui.exe mem usage is still 811 100K.
Type gc(). RGui.exe mem usage is now 28 332K.
Note that gc shoud be called automatically if you have removed objects from your workspace, and then you try to allocate more memory to new variables.
My impression is that multiple forms of gc() are tried before R reports failed memory allocation. I'm not aware of a solution for this at present, other than restarting R as you suggest. It appears that R does not defragment memory.
An old question, I realize, but I've found that (on OS Mojave), invoking pryr::mem_used() in the R session causes the activity monitor to immediately update the reported memory usage to reflect only the objects retained in the R environment.

Resources