GPU usage shows zero when CUDA with PyTorch using on Windows - windows

I have pytorch script.
import torch
torch.cuda.is_available()
# True
device=torch.device('cuda:0')
# I moved my tensors to device
But Windows Task Manager shows zero GPU (NVIDIA GTX 1050TI) usage when pytorch script running
Speed of my script is fine and if I had changing torch.device to CPU instead GPU a speed become slower, therefore cuda (GPU) is working. Why Windows Task Manager doesn't show GPU usage?
Sample of my code:
device=torch.device("cuda:0")
model=torch.load('mymodel.pth', map_location=torch.device(device))
image=Image.open('picture.png').convert('RGB')
transform=transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
input=transform(image)
input=torch.unsqueeze(input, 0)
input=input.to(device)
output=model(input)

Windows task manager overall utilization does not seem to include cuda usage. Make sure you select the cuda option in the graphs.
For details see: https://medium.com/#michaelceber/gpu-monitoring-on-windows-10-for-machine-learning-cuda-41088de86d65

Just calling torch.device('cuda:0') doesn't actually use the GPU. It's just an identifier for a device.
Instead, following the documentation, you should move your tensors and models to the GPU.
torch.randn((2,3), device=torch.device('cuda:0'))
# Or
tensor = torch.randn((2,3))
cuda0 = torch.device('cuda:0')
tensor.to(cuda0)

Please install GPU-Z and then you will be able to see the correct GPU load in Windows.

Related

Unpredictable CUDNN_STATUS_NOT_INITIALIZED on Windows

I am running keras neural network training and prediction on GTX 1070 on Windows 10. Most times it is working, but from time to time it complains
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_dnn.cc:366] error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\kernels\conv_ops.cc:659] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
It cannot be explained neither by literally error meaning nor by OOM error.
How to fix?
Try limiting your gpu usage with set gpu option per_process_gpu_memory_fraction.
Fiddle around with it to see what works and what doesn't.
I recommend using .7 as a starting baseline.
I met the problem sometimes on Windows10 and Keras.
Reboot solve the problem for a short time, but happen again.
I refer to https://github.com/fchollet/keras/issues/1538
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
the settings solve the halt problem.
Got the solution for this problem.
I had the same problem on Windows 10 with Nvidia GEforce 920M.
Search for the correct version of cudnn library. If the version is not compatable with the CUDA version it won't throw the error while tensorflow installation but will interfere during memory allocation in the GPU.
DO check your CUDA and CUDNN versions. Also follow the instructions about creation of sessions mentioned above.
Finally the issue is now resolved for me, I spent many hours struggling with this.
I recommend follow all the steps of installation properly as mentioned in
links
TensorFlow-
https://www.tensorflow.org/install/install_windows
and for CuDNN -
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#install-windows
for me this wasn't enough, I tried updating my GeForce Game Ready Driver from GeForce Experience window, and after restart it started working for me.
GeForce Experience
the driver can also be downloaded from link https://www.geforce.com/drivers
Similar to what other people are saying, enabling memory growth for your GPUs can resolve this issue.
The following works for me by adding to the beginning of the training script:
# Using Tensorflow-2.4.x
import tensorflow as tf
try:
tf_gpus = tf.config.list_physical_devices('GPU')
for gpu in tf_gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except:
pass
the tf doku help me a lot Allowing GPU memory growth
The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations: it starts out allocating very little memory, and as Sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process. Note that we do not release memory, since that can lead to even worse memory fragmentation. To turn this option on, set the option in the ConfigProto by:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
or
with tf.Session(graph=graph_node, config=config) as sess:
...
The second method is the per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. For example, you can tell TensorFlow to only allocate 40% of the total memory of each GPU by:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

How to pick the right Metal Device for GPU processing on a Mac Pro

When creating a new CIContext with Metal device one has to provide which device (a GPU) to use:
let context = CIContext(
mtlDevice: device
)
On my MacBook Pro for the development purposes I always pick the device associated with the screen with MTLCreateSystemDefaultDevice() method:
guard
let device:MTLDevice = MTLCreateSystemDefaultDevice()
else {
exit(EXIT_FAILURE)
}
However on a Mac Pro which will be used in production in a headless mode there are two GPU cards that I can target. In order to get all available devices one can use MTLCopyAllDevices() method which gives the following output on my Mac Pro:
[
<MTLDebugDevice: 0x103305450> -> <BronzeMtlDevice: 0x10480a200>
name = AMD Radeon HD - FirePro D700
<MTLDebugDevice: 0x103307730> -> <BronzeMtlDevice: 0x104814800>
name = AMD Radeon HD - FirePro D700
]
This Mac Pro will be utilised heavily with hundreds of small tasks per second and every time the new task comes in I need to select a GPU device on which the task will be processed.
Now the question is - is picking a random device from the above array a good idea:
let devices = MTLCopyAllDevices() // get all available devices
let rand = Int(arc4random_uniform(UInt32(devices.count))) // random index
let device = devices[rand] // randomly selected GPU to use
let context = CIContext(
mtlDevice: device
)
Since there are two equal GPU devices on a Mac Pro, targeting always one will be a waste of resources. Logic tells me that with the above code both GPUs will be utilised equally but maybe I'm wrong and MacOS offer some kind of abstraction layer that will intelligently pick the GPU which is less utilised at the time of execution?
Thank you in advance.
Why not just alternate between them? Even if you're committing command buffers from multiple threads, the work should be spread roughly evenly:
device = devices[taskIndex % devices.count]
Also, make sure to avoid creating CIContexts for every operation; those are expensive, so you should keep a list of contexts (one per device) instead.
Note that if you're doing any of your own Metal work (as opposed to just Core Image filtering), you'll need to have a command queue for each device, and any resources you want to use will need to be allocated by their respective device (resources can't be shared by MTLDevices).

How to request use of integrated GPU when using Metal API?

According to Apple documentation, when adding the value "YES" (or true) for key "NSSupportsAutomaticGraphicsSwitching" to the Info.plist file for an OSX app, the integrated GPU will be invoked on dual-GPU systems (as opposed to the discrete GPU). This is useful as the integrated GPU -- while less performant -- is adequate for my app's needs and consumes less energy.
Unfortunately, building as per above and subsequently inspecting the Activity Monitor (Energy tab: "Requires High Perf GPU" column) reveals that my Metal API-enabled app still uses the discrete GPU, despite requesting the integrated GPU.
Is there any way I can give a hint to the Metal system itself to use the integrated GPU?
The problem was that Metal API defaults to using the discrete GPU. Using the following code, along with the correct Info.plist configuration detailed above, results in the integrated GPU being used:
NSArray<id<MTLDevice>> *devices = MTLCopyAllDevices();
gpu_ = nil;
// Low power device is sufficient - try to use it!
for (id<MTLDevice> device in devices) {
if (device.isLowPower) {
gpu_ = device;
break;
}
}
// below: probably not necessary since there is always
// integrated GPU, but doesn't hurt.
if (gpu_ == nil)
gpu_ = MTLCreateSystemDefaultDevice();
If you're using an MTKView remember to pass gpu_ to the its initWithFrame:device: method.

Error using Tensorflow with GPU

I've tried a bunch of different Tensorflow examples, which works fine on the CPU but generates the same error when I'm trying to run them on the GPU. One little example is this:
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)
The error is always the same, CUDA_ERROR_OUT_OF_MEMORY:
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0a:00.0
Total memory: 11.25GiB
Free memory: 105.73MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0b:00.0
Total memory: 11.25GiB
Free memory: 133.48MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:0b:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 105.48MiB bytes.
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 105.48M (110608384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Check failed: gpu_mem != nullptr Could not allocate GPU device memory for device 0. Tried to allocate 105.48MiB
Aborted (core dumped)
I guess that the problem has to do with my configuration rather than the memory usage of this tiny example. Does anyone have any idea?
Edit:
I've found out that the problem may be as simple as someone else running a job on the same GPU, which would explain the little amount of free memory. In that case: sorry for taking up your time...
There appear to be two issues here:
By default, TensorFlow allocates a large fraction (95%) of the available GPU memory (on each GPU device) when you create a tf.Session. It uses a heuristic that reserves 200MB of GPU memory for "system" uses, but doesn't set this aside if the amount of free memory is smaller than that.
It looks like you have very little free GPU memory on either of your GPU devices (105.73MiB and 133.48MiB). This means that TensorFlow will attempt to allocate memory that should probably be reserved for the system, and hence the allocation fails.
Is it possible that you have another TensorFlow process (or some other GPU-hungry code) running while you attempt to run this program? For example, a Python interpreter with an open session—even if it is not using the GPU—will attempt to allocate almost the entire GPU memory.
Currently, the only way to restrict the amount of GPU memory that TensorFlow uses is the following configuration option (from this question):
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
This can happen because your TensorFlow session is not able to get sufficient amount of memory in the GPU. Maybe you have a low amount of free memory for other processes like TensorFlow or there is another TensorFlow session running in your system . so you have to configure the amount of memory the TensorFlow session will use
if you are using TensorFlow 1.x
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
as Tensorflow 2.x has undergone major changes from 1.x.if you want to use TensorFlow 1.x versions method/function there is a compatibility module kept in TensorFlow 2.x. So TensorFlow 2.x user can use this piece of code
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))

Querying the set of active CUDA kernels on a GPU

Is there a way to ask the GPU (or driver) to list the set of active (or dispatched or issued) CUDA kernels on a GPU, without attaching cuda-gdb to the owning CPU process and suspending it?
I'm imagining something like pstack, where the interface might look like:
> list-cuda-kernels $pid
gpu 0: kernel_foo
gpu 0: kernel_bar
gpu 1: kernel_baz
There is no tool or API to fetch list of the currently running kernels other then cuda-gdb (or any other CUDA debugger for that matter).

Resources