Error using Tensorflow with GPU - gpgpu

I've tried a bunch of different Tensorflow examples, which works fine on the CPU but generates the same error when I'm trying to run them on the GPU. One little example is this:
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)
The error is always the same, CUDA_ERROR_OUT_OF_MEMORY:
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0a:00.0
Total memory: 11.25GiB
Free memory: 105.73MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0b:00.0
Total memory: 11.25GiB
Free memory: 133.48MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:0b:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 105.48MiB bytes.
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 105.48M (110608384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Check failed: gpu_mem != nullptr Could not allocate GPU device memory for device 0. Tried to allocate 105.48MiB
Aborted (core dumped)
I guess that the problem has to do with my configuration rather than the memory usage of this tiny example. Does anyone have any idea?
Edit:
I've found out that the problem may be as simple as someone else running a job on the same GPU, which would explain the little amount of free memory. In that case: sorry for taking up your time...

There appear to be two issues here:
By default, TensorFlow allocates a large fraction (95%) of the available GPU memory (on each GPU device) when you create a tf.Session. It uses a heuristic that reserves 200MB of GPU memory for "system" uses, but doesn't set this aside if the amount of free memory is smaller than that.
It looks like you have very little free GPU memory on either of your GPU devices (105.73MiB and 133.48MiB). This means that TensorFlow will attempt to allocate memory that should probably be reserved for the system, and hence the allocation fails.
Is it possible that you have another TensorFlow process (or some other GPU-hungry code) running while you attempt to run this program? For example, a Python interpreter with an open session—even if it is not using the GPU—will attempt to allocate almost the entire GPU memory.
Currently, the only way to restrict the amount of GPU memory that TensorFlow uses is the following configuration option (from this question):
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

This can happen because your TensorFlow session is not able to get sufficient amount of memory in the GPU. Maybe you have a low amount of free memory for other processes like TensorFlow or there is another TensorFlow session running in your system . so you have to configure the amount of memory the TensorFlow session will use
if you are using TensorFlow 1.x
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
as Tensorflow 2.x has undergone major changes from 1.x.if you want to use TensorFlow 1.x versions method/function there is a compatibility module kept in TensorFlow 2.x. So TensorFlow 2.x user can use this piece of code
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))

Related

GPU usage shows zero when CUDA with PyTorch using on Windows

I have pytorch script.
import torch
torch.cuda.is_available()
# True
device=torch.device('cuda:0')
# I moved my tensors to device
But Windows Task Manager shows zero GPU (NVIDIA GTX 1050TI) usage when pytorch script running
Speed of my script is fine and if I had changing torch.device to CPU instead GPU a speed become slower, therefore cuda (GPU) is working. Why Windows Task Manager doesn't show GPU usage?
Sample of my code:
device=torch.device("cuda:0")
model=torch.load('mymodel.pth', map_location=torch.device(device))
image=Image.open('picture.png').convert('RGB')
transform=transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
input=transform(image)
input=torch.unsqueeze(input, 0)
input=input.to(device)
output=model(input)
Windows task manager overall utilization does not seem to include cuda usage. Make sure you select the cuda option in the graphs.
For details see: https://medium.com/#michaelceber/gpu-monitoring-on-windows-10-for-machine-learning-cuda-41088de86d65
Just calling torch.device('cuda:0') doesn't actually use the GPU. It's just an identifier for a device.
Instead, following the documentation, you should move your tensors and models to the GPU.
torch.randn((2,3), device=torch.device('cuda:0'))
# Or
tensor = torch.randn((2,3))
cuda0 = torch.device('cuda:0')
tensor.to(cuda0)
Please install GPU-Z and then you will be able to see the correct GPU load in Windows.

Julia 1.1 with JLD HDF5 package and memory release in Windows

I'm using Julia 1.1 with JLD and HDF5 to save a file onto the disk, where I met a couple of question about the memory usage.
Issue 1:
First, I defined a 4 GB matrix A.
A = zeros(ComplexF64,(243,243,4000));
When I type the command and look at windows task manager:
A=nothing
It took several minutes for Julia to release those memory back to me. Most of the time, (In Task manager) Julia just doesn't release the memory usage at all, even though the command returned results saying that A occupied 0 bytes instantly.
varinfo()
name size summary
–––––––––––––––– ––––––––––– –––––––
A 0 bytes Nothing
Base Module
Core Module
InteractiveUtils 162.930 KiB Module
Main Module
ans 0 bytes Nothing
Issue 2:
Further, when I tried to use JLD and HDF5 to save file onto the disk. This time, the task manager told me that, when using the save("test.jld", "A", A) command, an extra 4GB memory was used.
using JLD,HDF5
A = zeros(ComplexF64,(243,243,4000));
save("test.jld", "A", A)
Further, after I typed
A=nothing
Julia won't release the 8 GB memory back to me.
Finding 3:
An interesting thing I found was that, if I retype the command
A = zeros(ComplexF64,(243,243,4000));
The task manager would told me the cashed memory was released, and the total memory usage was again only 4GB.
Question 1:
What's going on with memory management in Julia? Was it just a mistake by Windows, or some command in Julia? How to check the Julia memory usage instantly?
Question 2:
How to tell the Julia to instantly release the memory usage?
Question 3:
Is there a way to tell JLD package not use those extra 4GB meomory?
(Better, could someone tell me how to create A directly on the disk without even creating it in the memory? I knew there's memory mapped I/O in JLD package. I have tried it, but it seemed to require me to create matrix A in the memory and save A onto the disk first, before I could recall the memory mapped A again. )
This is a long question, so thanks ahead!
Julia uses garbage collector to de-alocate the memory. Usually a garbage collector does not run after every line of code but only when needed.
Try to force garbage collection by running the command:
GC.gc()
This releases memory space for unreferenced Julia objects. In this way you can check whether the memory actually has been released.
Side note: JLD used to be somewhat not-always-working (I do not know the current status). Hence you first consideration for non-cross-platform object persistence always should be the serialize function from the in-built Serialization package - check the documentation at https://docs.julialang.org/en/v1/stdlib/Serialization/index.html#Serialization.serialize

Why are OpenGL and CUDA contexts memory greedy?

I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):
Application startup - Less than 1 Mb total
OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total
CUDA 8.0 context (Driver API) creation 114 Mb total.
If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.
So after OpenGL and CUDA context are up, the process already consumes 114 Mb.
Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.
I would like to know:
Why GL/CUDA context creation consumes such a considerable amount of memory?
Are there ways to optimize that?
The system setup under test:
Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.
I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.
UPDATE
I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):
I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.
Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.
I'm using CUDA (not OpenGL) on GNU/Linux:
CUDA version: 10.2.89
OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
Kernel: Linux 5.2.0
Processor: Intel x86_64
To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):
#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>
static void print_allocation_stats(const char* s)
{
printf("%s:\n", s);
printf("--------------------------------------------------\n");
malloc_stats();
printf("--------------------------------------------------\n\n");
}
int main()
{
display_mallinfo("Initially");
int status = cuInit(0);
if (status != 0 ) { return EXIT_FAILURE; }
print_allocation_stats("After CUDA driver initialization");
int device_id = 0;
unsigned flags = 0;
CUcontext context_id;
status = cuCtxCreate(&context_id, flags, device_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context creation");
status = cuCtxDestroy(context_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context destruction");
return EXIT_SUCCESS;
}
(note that this uses a glibc-specific function, not in the standard library.)
Summarizing the results and snipping irrelevant parts:
Point in program
Total bytes
In-use
Max MMAP Regions
Max MMAP bytes
Initially
135168
1632
0
0
After CUDA driver initialization
552960
439120
2
307200
After context creation
9314304
6858208
8
6643712
After context destruction
7016448
580688
8
6643712
So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.
It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.
Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...

Tensorflow, Keras and GPUS: logs show Resource Exhausted Error before simply loading up model weights

I am new to Ubuntu and I am setting up a new machine for deep learning using Keras and Tensorflow. I am fine tuning VGG16 on a set of pretty complex medical images. My machine specifications are:-
i7-6900K CPU # 3.20GHz × 16
GeForce GTX 1080 Ti x 4
62.8 GiB of RAM
My previous machine was an iMac with no GPU but an i7 quad core processor and 32GB of RAM. The iMac ran the following model although it took 32 hours to complete it.
Here is the code:-
img_width, img_height = 512, 512
top_model_weights_path = '50435_train_uip_possible_inconsistent.h5'
train_dir = '../../MasterHRCT/50435/Three-Classes/train'
validation_dir = '../../MasterHRCT/50435/Three-Classes/validation'
nb_train_samples = 50435
nb_validation_samples = 12600
epochs = 200
batch_size = 16
datagen = ImageDataGenerator(rescale=1. / 255)
model = applications.VGG16(include_top=False, weights='imagenet')
Then:-
generator_train = datagen.flow_from_directory(
train_dir,
target_size=(img_width, img_height),
shuffle=False,
class_mode=None,
batch_size=batch_size
)
bottleneck_features_train = model.predict_generator(
generator=generator_train,
steps=nb_train_samples // batch_size,
verbose=1
)
np.save(file="50435_train_uip_possible_inconsistent.npy", arr=bottleneck_features_train)
print("Completed train data")
generator_validation = datagen.flow_from_directory(
validation_dir,
target_size=(img_width, img_height),
shuffle=False,
class_mode=None,
batch_size=batch_size
)
bottleneck_features_validation = model.predict_generator(
generator=generator_validation,
steps=nb_validation_samples // batch_size,
verbose=1
)
np.save(file="12600_validate_uip_possible_inconsistent.npy", arr=bottleneck_features_validation)
print("Completed validation data")
Yesterday, I ran this code and it was super fast (nvidia-smi suggested that only one GPU was being used which I believe is expected for TF). The CPU hit 56% of maximum. Then it crashed - with a CUDA_OUT_OF_MEMORY error. So I lowered the batch size to 4. Again, its started really fast but then the CPU jumped to 100% and my system froze. I had to hard reboot.
I have tried again today, and the first time I get this error when simply trying to load the ImageNet Weights...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,512,512]
[[Node: block4_conv2_2/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5932420, _device="/job:localhost/replica:0/task:0/gpu:0"](block4_conv2_2/random_uniform/shape)]]
On the command line it says:-
2017-08-08 06:13:57.937723: I tensorflow/core/common_runtime /bfc_allocator.cc:700] Sum Total of in-use chunks: 71.99MiB
2017-08-08 06:13:57.937739: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 80150528
InUse: 75491072
MaxInUse: 80069120
NumAllocs: 177
MaxAllocSize: 11985920
Now clearly this is a memory issue - but why would it fail to even load the weights. My Mac can run this entire code albeit intractably slowly. I should note that this morning, I did get this code running once, but this time, it was ridiculously slow - slower than my Mac. My ignorant view is that something is chewing memory but I can't debug this...I am uncertain where to begin being new to Ubuntu. Having the precedence of seeing the code run super fast (and then toward the end crash) yesterday I wonder has the system 'reset' something or disabled something.
Help!
EDIT:
I cleared all the variables in jupyter notebook, dropped the batch size to 1 and reloaded and I managed to load the weights, but on running the first generator I get:
ResourceExhaustedError: OOM when allocating tensor with shape[1,512,512,64]
[[Node: block1_conv1/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_input_1_0_0/_105, block1_conv1/kernel/read)]]
I am not clear why I can succesfully run this on my Mac but not a machine with greater RAM, CPU and 4 GPUs...

Unpredictable CUDNN_STATUS_NOT_INITIALIZED on Windows

I am running keras neural network training and prediction on GTX 1070 on Windows 10. Most times it is working, but from time to time it complains
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_dnn.cc:366] error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\kernels\conv_ops.cc:659] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
It cannot be explained neither by literally error meaning nor by OOM error.
How to fix?
Try limiting your gpu usage with set gpu option per_process_gpu_memory_fraction.
Fiddle around with it to see what works and what doesn't.
I recommend using .7 as a starting baseline.
I met the problem sometimes on Windows10 and Keras.
Reboot solve the problem for a short time, but happen again.
I refer to https://github.com/fchollet/keras/issues/1538
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
the settings solve the halt problem.
Got the solution for this problem.
I had the same problem on Windows 10 with Nvidia GEforce 920M.
Search for the correct version of cudnn library. If the version is not compatable with the CUDA version it won't throw the error while tensorflow installation but will interfere during memory allocation in the GPU.
DO check your CUDA and CUDNN versions. Also follow the instructions about creation of sessions mentioned above.
Finally the issue is now resolved for me, I spent many hours struggling with this.
I recommend follow all the steps of installation properly as mentioned in
links
TensorFlow-
https://www.tensorflow.org/install/install_windows
and for CuDNN -
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#install-windows
for me this wasn't enough, I tried updating my GeForce Game Ready Driver from GeForce Experience window, and after restart it started working for me.
GeForce Experience
the driver can also be downloaded from link https://www.geforce.com/drivers
Similar to what other people are saying, enabling memory growth for your GPUs can resolve this issue.
The following works for me by adding to the beginning of the training script:
# Using Tensorflow-2.4.x
import tensorflow as tf
try:
tf_gpus = tf.config.list_physical_devices('GPU')
for gpu in tf_gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except:
pass
the tf doku help me a lot Allowing GPU memory growth
The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations: it starts out allocating very little memory, and as Sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process. Note that we do not release memory, since that can lead to even worse memory fragmentation. To turn this option on, set the option in the ConfigProto by:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
or
with tf.Session(graph=graph_node, config=config) as sess:
...
The second method is the per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. For example, you can tell TensorFlow to only allocate 40% of the total memory of each GPU by:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

Resources