TensorFlow - Low GPU usage on Titan X - performance

For a while, I have been noticing that TensorFlow (v0.8) does not seem to fully use the computation power of my Titan X. For several CNNs that I have been running the GPU usage does not seem to exceed ~30%. Typically the GPU utilization is even lower, more like 15%. One particular example of a CNN that shows this behavior is the CNN from DeepMind's Atari paper with Q-learning (see link below for code).
When I see other people of our lab running CNNs written in Theano or Torch the GPU usage is typically 80%+. This makes me wondering, why are the CNNs that I write in TensorFlow so 'slow' and what can I do to make more efficient use of the GPU processing power? Generally, I am interested in ways to profile the GPU operations and discover where the bottlenecks are. Any recommendations how to do this are very welcome since this seems not really possible with TensorFlow at the moment.
Things I did to find out more about the cause of this problem:
Analyzing TensorFlow's device placement, everything seems to be on gpu:/0 so looks OK.
Using cProfile, I have optimized the batch generation and other preprocessing steps. The preprocessing is performed on a single thread, but the actual optimization performed by TensorFlow steps take much longer (see average runtimes below). One obvious idea to increase the speed is by using TFs queue runners, but since the batch preparation is already 20x faster than optimization I wonder whether this is going to make a big difference.
Avg. Time Batch Preparation: 0.001 seconds
Avg. Time Train Operation: 0.021 seconds
Avg. Time Total per Batch: 0.022 seconds (45.18 batches/second)
Run on multiple machines to rule out hardware issues.
Upgraded to the latest versions of CuDNN v5 (RC), CUDA Toolkit 7.5 and reinstalled TensorFlow from sources about a week ago.
An example of the Q-learning CNN for which this 'problem' occurs can be found here: https://github.com/tomrunia/DeepReinforcementLearning-Atari/blob/master/qnetwork.py
Example of NVIDIA SMI displaying the low GPU utilization: NVIDIA-SMI

With the more recent versions of Tensorflow (I am using Tensorflow 1.4), we can obtain runtime statistics and visualize them in Tensorboard.
These statistics include compute time and memory usage for each node in the computation graph.

Related

CUDA testing in Julia - very low GPU utilization

I have been trying to set up CUDA computing under Julia for my RTX 2070 GPU and, so far, I did not get any errors related to failed CUDA initialization when executing CUDA-parallelized code. However, the parallelized computations seem surprisingly slow, so I launched Pkg.test("CUDA") from Julia to get some more insight into why that could be. Here is a screenshot of some of the results:
Julia CUDA test. The GPU allocation appears to be entirely negligible as compared to the CPU.
This is also reflected in the CUDA vs. CPU usage — running nvidia-smi shows 0% volatile GPU-util, while the CPU in the resource monitor was consistently at 80% and more usage throughout the test.
Further, the CUDA utilization graph in the task manager merely shows spikes in CUDA utilization rather than continuous usage: Screenshot of CUDA utilization in task manager.
Any suggestions for why this could be the case? I have went through the verification of proper CUDA package and driver installation several times now, and I am unsure what to do next.
As the comments note, the tests in Cuda.jl/test are designed to test the compilation pipeline, not really to put the GPU under any significant load. Just to complete the picture, if you do want to try loading the GPU, you might try modifying an example from https://cuda.juliagpu.org/stable/tutorials/introduction/, for example along the lines of
N = 2^20
using CUDA
x_d = CUDA.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0
for i=1:100000
y_d .+= sqrt.(x_d)
end

How to pick/configure AWS GPU Instances to speed up TensorFlow.keras?

I have a LSTM tf.keras model with about 600MB of training data. It takes about 90 seconds for each training epoch. I have the latest version of tensorflow, which is v2.2. It runs on an AWS g3.4xlarge instance. This instance has the Tesla M60 GPU from Nvidia and has 8GB of RAM for the GPU.
I want to do hyperparameter tuning and so I need to speed up the execution. So I moved the model and data to an AWS p3.2xlarge instance which has a P100 GPU with 16GB of RAM. Then I found the training time for each epoch did not change at all.
So I switched to an even larger AWS instance, p3.8xlarge, which has 4 Tesla V100 GPUs and 64GB of RAM total. In the first run, it only used 1 GPU and yielded the same execution time for each epoch, about 90 seconds. So I found an article on tensorflow website on how to use all GPUs, https://www.tensorflow.org/guide/gpu
So with all 4 GPUs running, the execution time for an epoch went from 90 seconds to 112 seconds! I used nvtop and nvidia-smi to monitor the GPU performance, as shown below.
What do I need to do to reduce the execution time?
The amount of time to run each epoch will obviously be the same until you don't change anything in the network. Just running it on a bigger GPU is not the answer here.
First of all if you can change your network to reduce the number of parameters then it will be great. So, reducing your model is obviously the first thing to make it run faster.
But, if that is not possible, here are two thing you can do:
Use tf.mixed_precision to run your model faster.
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
It offers significant computational speedup by performing operations in the half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.
Use XLA.
import tensorflow as tf
tf.config.optimizer.set_jit(True)
Accelerated Linear Algebra, XLA is a domain-specific compiler for matrix operations. It can make your network faster without any changes in your source code.
Please try both of these. I have personally used mixed precision and it surely does reduce the network time.
Also, please next time don't jump to bigger instances as it is a waste of your money. Just try to reduce the number of parameters (i.e. network size) or use these two tricks. I'll update this answer if I find any new trick.

Maximize Tensorflow Performance

I'm using Tensorflow 1.2. for image segmentation on an AWS p2 instance (Tesla K80). Is there an easy way for me to find out if I can improve the performance of my code?
Here is what I know:
I measured the execution time of the various parts of my program and
99% of the time is spent calling session run.
sess.run([train_op, loss, labels_modified, output_modified],
feed_dict=feed_dict)
where feed_dict is a mapping from placeholders to tensors.
The session.run method only takes 0.43 seconds to execute for the following parameters: batch_size=1, image_height=512, image_width=512, channels=3.
The network has 14 convolutional layers (no dense layers) with a total of 11 million trainable parameters.
Because I'm doing segmentation I use a batch size of 1 and then compute the pixel-wise loss (512*512 cross entropy losses).
I tried to compile Tensorflow from source and got zero performance improvements.
I read through the performance guide https://www.tensorflow.org/performance/performance_guide but I don't want to spend a lot of time trying all of these suggestions. It already took me 8 hours to compile Tensorflow and it gave me zero benefits!
How can I find out which parts of the session run take most of the time? I have a feeling that it might be the loss calculation.
And is there any clear study that shows how much speedup I can expect from the things mentioned in the performance guide?
You're performing a computationally intensive task that requires a lot of calculations and a lot of memory. Your model has a lot of parameters and each one requires to be computed forward, backward and updated.
The suggestions in the page you linked are OK and if you followed them all there's nothing else you can do, except creating another (1 or more) instance and run the train in parallel. This will give you a Nx speed up (where N is the number of instances that compute the gradients for your input batch) but it's extremely expensive and not always applicable (moreover it requires to change you code in order to make it follow the client-server architecture for the gradient computation and weight updates)
Based on your small piece of code, I see you're using a feed dictionary. Generally it's best to avoid using feed dictionaries if queues can be used (see https://github.com/tensorflow/tensorflow/issues/2919). The Tensorflow documentation covers the use of queues here. Switching to queues will definitely improve your performance.
Maybe you can run your code with tfprof to do some profiling to find out where the bottleneck is.
For just guessing, the performance problem may caused by feeding data. Don't how did you prepare your feed_dict, if you have to read you data from disk for preparing your feed_dict for every sess.run, it will slow the program for reading data and training is in synchronous. you can try to covert you data to tfrecords, make loading data and training in asynchronous by using tf.FIFOQueue

What is the optimum hardware requirements to run MXNET smoothly

I am using my MacBookPro. I am trying to run the mxnet python demo code and the execution time is extremely slow. It takes a lot time to execute the code. Is this normal? Also i want to run mxnet on Raspberry Pi 3.
Almost all deep learning frameworks (MXNet included) will run much faster with a CUDA-capable GPU from NVIDIA. GPU's will often speed up the kinds of vector math needed for deep learning by 100x. Apple stopped building machines with NVIDIA GPUs several years ago (2012 IIRC). If you have one of those make sure you have CUDA working on your Mac. I'm not aware of any way right now to get MXNet to make use of the AMD or Intel GPUs that ship with Apple machines. Also know that even with the fastest GPU's deep learning jobs will often take hours, days, or even weeks to complete. So patience is definitely part of the game, regardless of what hardware you're using.
That said, GPU's aren't the only way to run deep learning systems. Particularly for making predictions (inference) with pre-trained models, CPUs are often just fine. So this can be useful for a task like semantic image processing.
Or when training, using smaller datasets and smaller models can make them run faster. Also, to make sure you're getting the most out of your CPU, check that you have installed a good BLAS library like Intel's MKL.
But to get any useful work out of a raspberry pi is going to take some careful optimization, even for inference. This is an area of active scientific research. See for example this paper. Or look at adding a USB hardware accelerator.

How to get best performance of 8 core system using INTEL fortran

Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.

Resources