With
import pyopencl as cl
platform = cl.get_platforms()[1]
device = platform.get_devices()
I have two devices :
<pyopencl.Device 'Intel(R) Core(TM) i5-3570K CPU # 3.40GHz' on 'Intel(R) OpenCL' at 0x9f7a0e0>,
<pyopencl.Device 'Intel(R) HD Graphics 4000' on 'Intel(R) OpenCL' at 0x9efdab0>
with
device = platform.get_devices()[0]
I chose the CPU as my device. However, if I use the CPU to create a context (ctx = cl.Context([device]) ). I got this error:
RuntimeError: clCreateContext failed: DEVICE_NOT_AVAILABLE
But the context worked for my two GPU (Intel Graphic, and GTX970 on platform[0]).
Any idea why this error happened?
Many thanks
Jiajun
Related
I'm trying to write a kernel module for an Intel FPGA design supporting PCIe SR-IOV and placed in the x16 PCIe slot of an IBase M991 Mainboard (Q170 PCH, VT-d activated in BIOS, Integrated graphics only mode enabled).
The CPU is an Intel Core i7-6700TE, which also supports virtualization.
Furthermore I'm using a Yocto - Morty Distribution (Linux Kernel 4.19) with the following Kconfigs enabled:
CONFIG_PCI_IOV=y
CONFIG_PCI_DEBUG=y
CONFIG_INTEL_IOMMU_SVM=y
CONFIG_PCI_REALLOC_ENABLE_AUTO=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_IRQ_REMAP=y
CONFIG_IOMMU_DEFAULT_PASSTHROUGH=y
CONFIG_DYNAMIC_DEBUG=y
When doing all of this I see my driver loading (probe function gets called), but after calling pci_enable_sriov with the number of VF I want to activate I get the kernel message
not enough MMIO resources for SR-IOV
What am I doing wrong here? Is there an init function I need to call?
Many thanks for your help.
Edit: More information about the PCIe device:
1 PF, 8 VF
2 BARs (BAR0 and BAR2)
non prefetchable, 32 bit BARs
each BAR size is 4 kB (12bit)
I have pytorch script.
import torch
torch.cuda.is_available()
# True
device=torch.device('cuda:0')
# I moved my tensors to device
But Windows Task Manager shows zero GPU (NVIDIA GTX 1050TI) usage when pytorch script running
Speed of my script is fine and if I had changing torch.device to CPU instead GPU a speed become slower, therefore cuda (GPU) is working. Why Windows Task Manager doesn't show GPU usage?
Sample of my code:
device=torch.device("cuda:0")
model=torch.load('mymodel.pth', map_location=torch.device(device))
image=Image.open('picture.png').convert('RGB')
transform=transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
input=transform(image)
input=torch.unsqueeze(input, 0)
input=input.to(device)
output=model(input)
Windows task manager overall utilization does not seem to include cuda usage. Make sure you select the cuda option in the graphs.
For details see: https://medium.com/#michaelceber/gpu-monitoring-on-windows-10-for-machine-learning-cuda-41088de86d65
Just calling torch.device('cuda:0') doesn't actually use the GPU. It's just an identifier for a device.
Instead, following the documentation, you should move your tensors and models to the GPU.
torch.randn((2,3), device=torch.device('cuda:0'))
# Or
tensor = torch.randn((2,3))
cuda0 = torch.device('cuda:0')
tensor.to(cuda0)
Please install GPU-Z and then you will be able to see the correct GPU load in Windows.
I am new to Ubuntu and I am setting up a new machine for deep learning using Keras and Tensorflow. I am fine tuning VGG16 on a set of pretty complex medical images. My machine specifications are:-
i7-6900K CPU # 3.20GHz × 16
GeForce GTX 1080 Ti x 4
62.8 GiB of RAM
My previous machine was an iMac with no GPU but an i7 quad core processor and 32GB of RAM. The iMac ran the following model although it took 32 hours to complete it.
Here is the code:-
img_width, img_height = 512, 512
top_model_weights_path = '50435_train_uip_possible_inconsistent.h5'
train_dir = '../../MasterHRCT/50435/Three-Classes/train'
validation_dir = '../../MasterHRCT/50435/Three-Classes/validation'
nb_train_samples = 50435
nb_validation_samples = 12600
epochs = 200
batch_size = 16
datagen = ImageDataGenerator(rescale=1. / 255)
model = applications.VGG16(include_top=False, weights='imagenet')
Then:-
generator_train = datagen.flow_from_directory(
train_dir,
target_size=(img_width, img_height),
shuffle=False,
class_mode=None,
batch_size=batch_size
)
bottleneck_features_train = model.predict_generator(
generator=generator_train,
steps=nb_train_samples // batch_size,
verbose=1
)
np.save(file="50435_train_uip_possible_inconsistent.npy", arr=bottleneck_features_train)
print("Completed train data")
generator_validation = datagen.flow_from_directory(
validation_dir,
target_size=(img_width, img_height),
shuffle=False,
class_mode=None,
batch_size=batch_size
)
bottleneck_features_validation = model.predict_generator(
generator=generator_validation,
steps=nb_validation_samples // batch_size,
verbose=1
)
np.save(file="12600_validate_uip_possible_inconsistent.npy", arr=bottleneck_features_validation)
print("Completed validation data")
Yesterday, I ran this code and it was super fast (nvidia-smi suggested that only one GPU was being used which I believe is expected for TF). The CPU hit 56% of maximum. Then it crashed - with a CUDA_OUT_OF_MEMORY error. So I lowered the batch size to 4. Again, its started really fast but then the CPU jumped to 100% and my system froze. I had to hard reboot.
I have tried again today, and the first time I get this error when simply trying to load the ImageNet Weights...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,512,512]
[[Node: block4_conv2_2/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5932420, _device="/job:localhost/replica:0/task:0/gpu:0"](block4_conv2_2/random_uniform/shape)]]
On the command line it says:-
2017-08-08 06:13:57.937723: I tensorflow/core/common_runtime /bfc_allocator.cc:700] Sum Total of in-use chunks: 71.99MiB
2017-08-08 06:13:57.937739: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 80150528
InUse: 75491072
MaxInUse: 80069120
NumAllocs: 177
MaxAllocSize: 11985920
Now clearly this is a memory issue - but why would it fail to even load the weights. My Mac can run this entire code albeit intractably slowly. I should note that this morning, I did get this code running once, but this time, it was ridiculously slow - slower than my Mac. My ignorant view is that something is chewing memory but I can't debug this...I am uncertain where to begin being new to Ubuntu. Having the precedence of seeing the code run super fast (and then toward the end crash) yesterday I wonder has the system 'reset' something or disabled something.
Help!
EDIT:
I cleared all the variables in jupyter notebook, dropped the batch size to 1 and reloaded and I managed to load the weights, but on running the first generator I get:
ResourceExhaustedError: OOM when allocating tensor with shape[1,512,512,64]
[[Node: block1_conv1/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_input_1_0_0/_105, block1_conv1/kernel/read)]]
I am not clear why I can succesfully run this on my Mac but not a machine with greater RAM, CPU and 4 GPUs...
When creating a new CIContext with Metal device one has to provide which device (a GPU) to use:
let context = CIContext(
mtlDevice: device
)
On my MacBook Pro for the development purposes I always pick the device associated with the screen with MTLCreateSystemDefaultDevice() method:
guard
let device:MTLDevice = MTLCreateSystemDefaultDevice()
else {
exit(EXIT_FAILURE)
}
However on a Mac Pro which will be used in production in a headless mode there are two GPU cards that I can target. In order to get all available devices one can use MTLCopyAllDevices() method which gives the following output on my Mac Pro:
[
<MTLDebugDevice: 0x103305450> -> <BronzeMtlDevice: 0x10480a200>
name = AMD Radeon HD - FirePro D700
<MTLDebugDevice: 0x103307730> -> <BronzeMtlDevice: 0x104814800>
name = AMD Radeon HD - FirePro D700
]
This Mac Pro will be utilised heavily with hundreds of small tasks per second and every time the new task comes in I need to select a GPU device on which the task will be processed.
Now the question is - is picking a random device from the above array a good idea:
let devices = MTLCopyAllDevices() // get all available devices
let rand = Int(arc4random_uniform(UInt32(devices.count))) // random index
let device = devices[rand] // randomly selected GPU to use
let context = CIContext(
mtlDevice: device
)
Since there are two equal GPU devices on a Mac Pro, targeting always one will be a waste of resources. Logic tells me that with the above code both GPUs will be utilised equally but maybe I'm wrong and MacOS offer some kind of abstraction layer that will intelligently pick the GPU which is less utilised at the time of execution?
Thank you in advance.
Why not just alternate between them? Even if you're committing command buffers from multiple threads, the work should be spread roughly evenly:
device = devices[taskIndex % devices.count]
Also, make sure to avoid creating CIContexts for every operation; those are expensive, so you should keep a list of contexts (one per device) instead.
Note that if you're doing any of your own Metal work (as opposed to just Core Image filtering), you'll need to have a command queue for each device, and any resources you want to use will need to be allocated by their respective device (resources can't be shared by MTLDevices).
I've tried a bunch of different Tensorflow examples, which works fine on the CPU but generates the same error when I'm trying to run them on the GPU. One little example is this:
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)
The error is always the same, CUDA_ERROR_OUT_OF_MEMORY:
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0a:00.0
Total memory: 11.25GiB
Free memory: 105.73MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0b:00.0
Total memory: 11.25GiB
Free memory: 133.48MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:0b:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 105.48MiB bytes.
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 105.48M (110608384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Check failed: gpu_mem != nullptr Could not allocate GPU device memory for device 0. Tried to allocate 105.48MiB
Aborted (core dumped)
I guess that the problem has to do with my configuration rather than the memory usage of this tiny example. Does anyone have any idea?
Edit:
I've found out that the problem may be as simple as someone else running a job on the same GPU, which would explain the little amount of free memory. In that case: sorry for taking up your time...
There appear to be two issues here:
By default, TensorFlow allocates a large fraction (95%) of the available GPU memory (on each GPU device) when you create a tf.Session. It uses a heuristic that reserves 200MB of GPU memory for "system" uses, but doesn't set this aside if the amount of free memory is smaller than that.
It looks like you have very little free GPU memory on either of your GPU devices (105.73MiB and 133.48MiB). This means that TensorFlow will attempt to allocate memory that should probably be reserved for the system, and hence the allocation fails.
Is it possible that you have another TensorFlow process (or some other GPU-hungry code) running while you attempt to run this program? For example, a Python interpreter with an open session—even if it is not using the GPU—will attempt to allocate almost the entire GPU memory.
Currently, the only way to restrict the amount of GPU memory that TensorFlow uses is the following configuration option (from this question):
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
This can happen because your TensorFlow session is not able to get sufficient amount of memory in the GPU. Maybe you have a low amount of free memory for other processes like TensorFlow or there is another TensorFlow session running in your system . so you have to configure the amount of memory the TensorFlow session will use
if you are using TensorFlow 1.x
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
as Tensorflow 2.x has undergone major changes from 1.x.if you want to use TensorFlow 1.x versions method/function there is a compatibility module kept in TensorFlow 2.x. So TensorFlow 2.x user can use this piece of code
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))