Why transfering a tensor to GPU results in additional memory allocation in RAM? - memory-management

I am trying to calculate the memory footprint of my fine-tuned model during the inference time. I wish to calculate how much RAM memory the model will need on a system with no GPU? and how much GPU memory this model needs on a system with GPU?
While calculating this, I observed that when I transfer my fine-tuned (pytorch) model from CPU to GPU, some additional memory on RAM is allocated for this. I am not able to understand why that happens. This answer is not comprehensive enough.
To replicate the problem use this code:
import time
import torch
import psutil
def stathere():
av = []
if torch.cuda.is_available():
av.append(torch.cuda.memory_allocated(torch.device("cuda"))/(1024*1024*1024))
else:
av.append(0)
av.append(psutil.virtual_memory().available/(1024*1024*1024))
a = time.time()
return av, a
def statnow(av, a):
if torch.cuda.is_available():
print("Memory taken on GPU", round(torch.cuda.memory_allocated(torch.device("cuda"))/(1024*1024*1024)-av[0],3), "GB")
print("Memory taken on RAM", round(av[1]-(psutil.virtual_memory().available/(1024*1024*1024)),3),"GB")
print(round(time.time()-a), "seconds taken")
return
av, a = stathere()
print('Tensor on RAM')
g = torch.rand(20000,20000)
statnow(av,a)
del g
av, a = stathere()
print('Tensor transfered on GPU')
g = torch.rand(20000,20000).to(torch.device("cuda:0"))
statnow(av,a)
Output
Tensor on RAM
Memory taken on GPU 0.0 GB
Memory taken on RAM 1.566 GB
5 seconds taken
Tensor transferred on GPU
Memory taken on GPU 1.49 GB
Memory taken on RAM 4.024 GB
17 seconds taken
EDIT: Moreover, the (additional) memory allocation on RAM is not additive. Consider this case, when I send a different tensor (g2 = torch.rand(10000,15000)) to the GPU, then I get a different memory consumption on RAM (0.9 GB). But when I send both the tensors (g and g2) to GPU, then the memory consumption on RAM is in negative (-1.4 GB).

Related

RuntimeError: CUDA out of memory. Tried to allocate... but memory is empty

I'm trying to run the train file from this Unet with their default hyperparameters, batch size = 1.
I have a GTX970 with 4GB and made Windows use the integrated graphics.
When I run nvidia-smi, it says that the memory of the GPU is almost free (52MiB / 4096MiB), "No running processes found " and pytorch uses the GPU not the integrated graphics
I do not understand what is using the memory:
RuntimeError: CUDA out of memory. Tried to allocate 150.00 MiB (GPU 0; 4.00 GiB total capacity; 2.77 GiB already allocated; 72.46 MiB free; 2.82 GiB reserved in total by PyTorch).
GPU memory allocation is not done all at once. As the program loads the data and the model, GPU memory usage gradually increases until the training actually starts. In your case, the program has allocated 2.7GB and tries to get more memory before training starts, but there is not enough space. 4GB GPU memory is usually too small for CV deep learning algorithms.

Unable to allocate GPU memory, when there is enough of cached memory

I am training vgg16 model from scratch on AWS EC2 Deep Learning AMI machine (Ubuntu 18.04.3 LTS (GNU/Linux 4.15.0-1054-aws x86_64v)) with Python3 (CUDA 10.1 and Intel MKL) (Pytorch 1.3.1) and facing below error while updating model parameters.
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 11.17 GiB total capacity; 10.76 GiB already allocated; 4.81 MiB free; 119.92 MiB cached)
Code for updating parameters:
def _update_fisher_params(self, current_ds, batch_size, num_batch):
dl = DataLoader(current_ds, batch_size, shuffle=True)
log_liklihoods = []
for i, (input, target) in enumerate(dl):
if i > num_batch:
break
output = F.log_softmax(self.model(input.cuda().float()), dim=1)
log_liklihoods.append(output[:, target])
log_likelihood = torch.cat(log_liklihoods).mean()
grad_log_liklihood = autograd.grad(log_likelihood, self.model.parameters())
_buff_param_names = [param[0].replace('.', '__') for param in self.model.named_parameters()]
for _buff_param_name, param in zip(_buff_param_names, grad_log_liklihood):
self.model.register_buffer(_buff_param_name+'_estimated_fisher', param.data.clone() ** 2)
After debugging: log_liklihoods.append(output[:, target]) line throws error after 157 iterations
I have the required memory but it does not allocate, I am not getting why updating the gradients is causing the memory problem, as gradients should be de-referenced and released automatically on each iteration. Any idea?
I have tried following solutions but no luck.
Lowering batch size
Freeing cache with torch.cuda.empty_cache()
Reducing the number of filters to reduce the memory footprint
Machine Specs:
Finally I solved the memory problem! I realized that in each iteration I put the input data in a new tensor, and pytorch generates a new computation graph.
That causes the used RAM to grow forever. Then I used .detach() function, and the RAM always stays at a low level.
self.model(input.cuda().float()).detach().requires_grad_(True)

Tensorflow/Keras performance better with less GPU memory?

Reducing GPU memory allocation increases the training speed of my CNN model. I expected the opposite - that speed should decrease with allocated memory. I suspect this has to do with time required for transferring data in and out of the GPU.
I ran cases with allocating a) all GPU memory and b) approximately half (46%) of GPU memory. Using all GPU memory (no restriction) training speed is 153 epochs/hour. When allocating 46% of GPU memory, training speed is 199 epochs/hour.
I set GPU memory via:
gpu_memory_fraction = 0.46 # Values tested: 0.46 and 1
config = tf.compat.v1.ConfigProto()
if gpu_memory_fraction < 1.0:
config.gpu_options.per_process_gpu_memory_fraction = gpu_memory_fraction
K.tensorflow_backend.set_session(tf.Session(config=config))
I use the keras model.fit() method for training. Batch size = 64.
Environment:
python 3.7.5
keras 2.2.4
tensorflow 1.15.0
conda 4.7.12
CUDA 10.2
Nvidia driver 440.44
Ubuntu 18.04.3
GPU: Nvidia GTX 1080Ti, 11GB memory
CPU Intel i9-9900K 3.6 GHz, RAM 64 GB, SSD 120 GB
Can anyone offer an explanation?

Why is the CPU slower for calculations then the GPU when only Memory should matter?

A modern CPU has a ethash hashrate from under 1MH/s (source: https://ethereum.stackexchange.com/questions/2325/is-cpu-mining-even-worth-the-ether ) while GPUs mine with over 20MH/s easily. With overclocked memory they reach rates up to 30MH/s.
GPUs have GDDR Memory with Clockrates of about 1000MHz while DDR4 runs with higher clock speeds. Bandwith of DDR4 seems also to be higher (sources: http://www.corsair.com/en-eu/blog/2014/september/ddr3_vs_ddr4_synthetic and https://en.wikipedia.org/wiki/GDDR5_SDRAM )
It is said for Dagger-Hashimoto/ethash bandwith of memory is the thing that matters (also experienced from overclocking GPUs) which I find reasonable since the CPU/GPU only has to do 2x sha3 (1x Keccak256 + 1x Keccak512) operations (source: https://github.com/ethereum/wiki/wiki/Ethash#main-loop ).
A modern Skylake processor can compute over 100M of Keccak512 operations per second (see here: https://www.cryptopp.com/benchmarks.html ) so then core count difference between GPUs and CPUs should not be the problem.
But why don't we get about ~50Mhash/s from 2xKeccak operations and memory loading on a CPU?
See http://www.nvidia.com/object/what-is-gpu-computing.html for an overview of the differences between CPU and GPU programming.
In short, a CPU has a very small number of cores, each of which can do different things, and each of which can handle very complex logic.
A GPU has thousands of cores, that operate pretty much in lockstep, but can only handle simple logic.
Therefore the overall processing throughput of a GPU can be massively higher. But it isn't easy to move logic from the CPU to the GPU.
If you want to dive in deeper and actually write code for both, one good starting place is https://devblogs.nvidia.com/gpu-computing-julia-programming-language/.
"A modern Skylake processor can compute over 100M of Keccak512 operations per second" is incorrect, it is 140 MiB/s. That is MiBs per second and a hash operation is more than 1 byte, you need to divide the 140 MiB/s by the number of bytes being hashed.
I found an article addressing my problem (the influence of Memory on the algorithm).
It's not only the computation problem (mentioned here: https://stackoverflow.com/a/48687460/2298744 ) it's also the Memorybandwidth which would bottelneck the CPU.
As described in the article every round fetches 8kb of data for calculation. This results in the following formular:
(Memory Bandwidth) / ( DAG memory fetched per hash) = Max Theoreticical Hashrate
(Memory Bandwidth) / ( 8kb / hash) = Max Theoreticical Hashrate
For a grafics card like the RX470 mentioned this results in:
(211 Gigabytes / sec) / (8 kilobytes / hash) = ~26Mhashes/sec
While for CPUs with DDR4 this will result in:
(12.8GB / sec) / (8 kilobytes / hash) = ~1.6Mhashes/sec
or (debending on clock speeds of RAM)
(25.6GB / sec) / (8 kilobytes / hash) = ~3.2Mhashes/sec
To sum up, a CPU or also GPU with DDR4 ram could not get more than 3.2MHash/s since it can't get the data fast enough needed for processing.
Source:
https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memory-hardness-explained/

Calculating actual flop/core when using actual memory bandwidth

I want to calculate the actual amount of mflop/s/core using the following information:
I have measured actual amount of memory bandwidth of each core in 1 node which is 4371 MB/s.
I have also measured mflop/s/core on one node if I use only one core on a node(in this case the whole memory of node would be available for that core), the result is 2094.45. So I measured the memory bandwidth which was available for that core = 10812.3 MB/s
So now I want to calculate the actual mflop/s/core when the core has it's real memory bandwidth (4371MB/s).
Do you think it would be correct if I calculate it like this:
actual mflop/s/core= (mflop/s/core * actual memory bw) / used memory bandwidth
Any help would be appreciated.

Resources