Tensorflow/Keras performance better with less GPU memory? - performance

Reducing GPU memory allocation increases the training speed of my CNN model. I expected the opposite - that speed should decrease with allocated memory. I suspect this has to do with time required for transferring data in and out of the GPU.
I ran cases with allocating a) all GPU memory and b) approximately half (46%) of GPU memory. Using all GPU memory (no restriction) training speed is 153 epochs/hour. When allocating 46% of GPU memory, training speed is 199 epochs/hour.
I set GPU memory via:
gpu_memory_fraction = 0.46 # Values tested: 0.46 and 1
config = tf.compat.v1.ConfigProto()
if gpu_memory_fraction < 1.0:
config.gpu_options.per_process_gpu_memory_fraction = gpu_memory_fraction
K.tensorflow_backend.set_session(tf.Session(config=config))
I use the keras model.fit() method for training. Batch size = 64.
Environment:
python 3.7.5
keras 2.2.4
tensorflow 1.15.0
conda 4.7.12
CUDA 10.2
Nvidia driver 440.44
Ubuntu 18.04.3
GPU: Nvidia GTX 1080Ti, 11GB memory
CPU Intel i9-9900K 3.6 GHz, RAM 64 GB, SSD 120 GB
Can anyone offer an explanation?

Related

Why transfering a tensor to GPU results in additional memory allocation in RAM?

I am trying to calculate the memory footprint of my fine-tuned model during the inference time. I wish to calculate how much RAM memory the model will need on a system with no GPU? and how much GPU memory this model needs on a system with GPU?
While calculating this, I observed that when I transfer my fine-tuned (pytorch) model from CPU to GPU, some additional memory on RAM is allocated for this. I am not able to understand why that happens. This answer is not comprehensive enough.
To replicate the problem use this code:
import time
import torch
import psutil
def stathere():
av = []
if torch.cuda.is_available():
av.append(torch.cuda.memory_allocated(torch.device("cuda"))/(1024*1024*1024))
else:
av.append(0)
av.append(psutil.virtual_memory().available/(1024*1024*1024))
a = time.time()
return av, a
def statnow(av, a):
if torch.cuda.is_available():
print("Memory taken on GPU", round(torch.cuda.memory_allocated(torch.device("cuda"))/(1024*1024*1024)-av[0],3), "GB")
print("Memory taken on RAM", round(av[1]-(psutil.virtual_memory().available/(1024*1024*1024)),3),"GB")
print(round(time.time()-a), "seconds taken")
return
av, a = stathere()
print('Tensor on RAM')
g = torch.rand(20000,20000)
statnow(av,a)
del g
av, a = stathere()
print('Tensor transfered on GPU')
g = torch.rand(20000,20000).to(torch.device("cuda:0"))
statnow(av,a)
Output
Tensor on RAM
Memory taken on GPU 0.0 GB
Memory taken on RAM 1.566 GB
5 seconds taken
Tensor transferred on GPU
Memory taken on GPU 1.49 GB
Memory taken on RAM 4.024 GB
17 seconds taken
EDIT: Moreover, the (additional) memory allocation on RAM is not additive. Consider this case, when I send a different tensor (g2 = torch.rand(10000,15000)) to the GPU, then I get a different memory consumption on RAM (0.9 GB). But when I send both the tensors (g and g2) to GPU, then the memory consumption on RAM is in negative (-1.4 GB).

RuntimeError: CUDA out of memory. Tried to allocate... but memory is empty

I'm trying to run the train file from this Unet with their default hyperparameters, batch size = 1.
I have a GTX970 with 4GB and made Windows use the integrated graphics.
When I run nvidia-smi, it says that the memory of the GPU is almost free (52MiB / 4096MiB), "No running processes found " and pytorch uses the GPU not the integrated graphics
I do not understand what is using the memory:
RuntimeError: CUDA out of memory. Tried to allocate 150.00 MiB (GPU 0; 4.00 GiB total capacity; 2.77 GiB already allocated; 72.46 MiB free; 2.82 GiB reserved in total by PyTorch).
GPU memory allocation is not done all at once. As the program loads the data and the model, GPU memory usage gradually increases until the training actually starts. In your case, the program has allocated 2.7GB and tries to get more memory before training starts, but there is not enough space. 4GB GPU memory is usually too small for CV deep learning algorithms.

Unable to allocate GPU memory, when there is enough of cached memory

I am training vgg16 model from scratch on AWS EC2 Deep Learning AMI machine (Ubuntu 18.04.3 LTS (GNU/Linux 4.15.0-1054-aws x86_64v)) with Python3 (CUDA 10.1 and Intel MKL) (Pytorch 1.3.1) and facing below error while updating model parameters.
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 11.17 GiB total capacity; 10.76 GiB already allocated; 4.81 MiB free; 119.92 MiB cached)
Code for updating parameters:
def _update_fisher_params(self, current_ds, batch_size, num_batch):
dl = DataLoader(current_ds, batch_size, shuffle=True)
log_liklihoods = []
for i, (input, target) in enumerate(dl):
if i > num_batch:
break
output = F.log_softmax(self.model(input.cuda().float()), dim=1)
log_liklihoods.append(output[:, target])
log_likelihood = torch.cat(log_liklihoods).mean()
grad_log_liklihood = autograd.grad(log_likelihood, self.model.parameters())
_buff_param_names = [param[0].replace('.', '__') for param in self.model.named_parameters()]
for _buff_param_name, param in zip(_buff_param_names, grad_log_liklihood):
self.model.register_buffer(_buff_param_name+'_estimated_fisher', param.data.clone() ** 2)
After debugging: log_liklihoods.append(output[:, target]) line throws error after 157 iterations
I have the required memory but it does not allocate, I am not getting why updating the gradients is causing the memory problem, as gradients should be de-referenced and released automatically on each iteration. Any idea?
I have tried following solutions but no luck.
Lowering batch size
Freeing cache with torch.cuda.empty_cache()
Reducing the number of filters to reduce the memory footprint
Machine Specs:
Finally I solved the memory problem! I realized that in each iteration I put the input data in a new tensor, and pytorch generates a new computation graph.
That causes the used RAM to grow forever. Then I used .detach() function, and the RAM always stays at a low level.
self.model(input.cuda().float()).detach().requires_grad_(True)

Tensorflow Object Detection Training Performance

I am training the ssd_mobilenet_v1_coco network on my own custom classes using the object detection API for tensorflow.
I have used the CPU (i7-6700) and GPU (NVIDIA Quadro K620) to train:
Processor Batch size sec/step sec/image
K620 1 0,45 0,450
K620 10 2,22 0,222
i7-6700 1 0,66 0,660
i7-6700 24 9,3 0,388
However, the GPU is only about 70% faster than the CPU.
I expected the GPU to be significantly faster.
Is this performance adequate for my hardware or is there something wrong?
maybe you can try Tensorflow SERVER

Building a Roofline Model

I'm trying to build a roofline model for a node in a supercomputer that I'm running simulations on. The node has 2x Intel Xeon E5-2650 v2 (Ivy Bridge) 8 core 2.6 GHz processors (16 cores per
node), with 64GB RAM total (4GB each). The maximum memory bandwidth for the Intel Xeon E5-2650 is shown here as 59.7 GB/s.
Achieved GFLOPS = max mem bandwidth x arithmetic intensity.
Max GFLOPS = num cores x clock frequency in GHz x ops/cycle.
My code has arithmetic intensity of 1/3 and uses double precision floating point.
Here are my calculations for calculating the peak GFLOPs for the different types of program:
Sequential program (single core) no vectorisation:
1x2.6x1 (I assume without vectorisation, we can only achieve 1 op/cycle?) = 2.6 GFLOPs
Sequential program (single core) with vectorisation (SSE):
1x2.6x8 = 20.8 GFLOPs
All cores on one Xeon with vectorisation (SSE):
8x2.6x8 = 166.4 GFLOPs
All cores one both Xeons with vectorisation (SSE):
2x 8x2.6x8 = 332.8 GFLOPs
How does the memory bandwidth available to the program change between the different types of program shown above? I know that the max memory bandwidth for 1 Xeon E5-2650 is 59.7 GB/s, however is this achieveable on a single core? Does this become 119.4 GB/s with 2 Xeon E2650s?
So would the achieved GFLOPs (using peak bandwidth x arithmetic intensity) be:
Sequential program w/o vectorisation:
59.7 * 1/3 = 19.9 GFLOPs, however because our roofline is 2.6 GFLOPs, we are limited to 2.6 GFLOPs?
Sequential program with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. This is achieveable because our roofline is 20.8 GFLOPs.
One Xeon (using all 8 cores) with vectorisation:
59.7 * 1/3 = 19.9 GFLOPs. I am suspicious of this, because surely our parallel program is capable of producing more mem reqs than the sequential program, and surely the sequential program doesn't saturate the memory system?
Two Xeons (total of 16 cores) with vectorisation:
119.4 * 1/3 = 39.8 GFLOPs.
I feel like something is wrong with the achieved GFLOPs, have I made a mistake somewhere?

Resources