All,
I've recently installed tensorflow on our cluster and have been running then MNIST example code to test the installation. I am interested in seeing how tensorflow an increase in the number of CPUs would effect tensorflow's performance. To that end I ran the same example code using different number of cores but it doesn't seem like tensorflow is doing a good job at scaling.
I found from another stackoverflow question that tensorflow automatically uses all of the available CPU's automatically and assumed it to be true while running the jobs.
Find below the plot of No. of cores vs Time taken. It is a mean of 5 trials with the number of cores going from 1 to 28.
Any explanation as to why the scaling is so bad? I did not a expect a continuous increase with increase in number of cores as I understand, that communication b/w cores is to be taken into account. But I was expecting to see a 'U-shaped' curve with an optimum number of cores for this particular code.
Appreciate any help I can get on this.
How do you run with different numbers of cores? Tensorflow by default uses two threadpools, each with a number of threads related to the total number of cores in the machine, so what you are seeing there is probably noise and tensorflow is always running with the same number of threads (look at its CPU usage to confirm).
If you want to restrict the number of cores used by tensorflow you can do this via the config proto.
Related
I have a LSTM tf.keras model with about 600MB of training data. It takes about 90 seconds for each training epoch. I have the latest version of tensorflow, which is v2.2. It runs on an AWS g3.4xlarge instance. This instance has the Tesla M60 GPU from Nvidia and has 8GB of RAM for the GPU.
I want to do hyperparameter tuning and so I need to speed up the execution. So I moved the model and data to an AWS p3.2xlarge instance which has a P100 GPU with 16GB of RAM. Then I found the training time for each epoch did not change at all.
So I switched to an even larger AWS instance, p3.8xlarge, which has 4 Tesla V100 GPUs and 64GB of RAM total. In the first run, it only used 1 GPU and yielded the same execution time for each epoch, about 90 seconds. So I found an article on tensorflow website on how to use all GPUs, https://www.tensorflow.org/guide/gpu
So with all 4 GPUs running, the execution time for an epoch went from 90 seconds to 112 seconds! I used nvtop and nvidia-smi to monitor the GPU performance, as shown below.
What do I need to do to reduce the execution time?
The amount of time to run each epoch will obviously be the same until you don't change anything in the network. Just running it on a bigger GPU is not the answer here.
First of all if you can change your network to reduce the number of parameters then it will be great. So, reducing your model is obviously the first thing to make it run faster.
But, if that is not possible, here are two thing you can do:
Use tf.mixed_precision to run your model faster.
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
It offers significant computational speedup by performing operations in the half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.
Use XLA.
import tensorflow as tf
tf.config.optimizer.set_jit(True)
Accelerated Linear Algebra, XLA is a domain-specific compiler for matrix operations. It can make your network faster without any changes in your source code.
Please try both of these. I have personally used mixed precision and it surely does reduce the network time.
Also, please next time don't jump to bigger instances as it is a waste of your money. Just try to reduce the number of parameters (i.e. network size) or use these two tricks. I'll update this answer if I find any new trick.
For a while, I have been noticing that TensorFlow (v0.8) does not seem to fully use the computation power of my Titan X. For several CNNs that I have been running the GPU usage does not seem to exceed ~30%. Typically the GPU utilization is even lower, more like 15%. One particular example of a CNN that shows this behavior is the CNN from DeepMind's Atari paper with Q-learning (see link below for code).
When I see other people of our lab running CNNs written in Theano or Torch the GPU usage is typically 80%+. This makes me wondering, why are the CNNs that I write in TensorFlow so 'slow' and what can I do to make more efficient use of the GPU processing power? Generally, I am interested in ways to profile the GPU operations and discover where the bottlenecks are. Any recommendations how to do this are very welcome since this seems not really possible with TensorFlow at the moment.
Things I did to find out more about the cause of this problem:
Analyzing TensorFlow's device placement, everything seems to be on gpu:/0 so looks OK.
Using cProfile, I have optimized the batch generation and other preprocessing steps. The preprocessing is performed on a single thread, but the actual optimization performed by TensorFlow steps take much longer (see average runtimes below). One obvious idea to increase the speed is by using TFs queue runners, but since the batch preparation is already 20x faster than optimization I wonder whether this is going to make a big difference.
Avg. Time Batch Preparation: 0.001 seconds
Avg. Time Train Operation: 0.021 seconds
Avg. Time Total per Batch: 0.022 seconds (45.18 batches/second)
Run on multiple machines to rule out hardware issues.
Upgraded to the latest versions of CuDNN v5 (RC), CUDA Toolkit 7.5 and reinstalled TensorFlow from sources about a week ago.
An example of the Q-learning CNN for which this 'problem' occurs can be found here: https://github.com/tomrunia/DeepReinforcementLearning-Atari/blob/master/qnetwork.py
Example of NVIDIA SMI displaying the low GPU utilization: NVIDIA-SMI
With the more recent versions of Tensorflow (I am using Tensorflow 1.4), we can obtain runtime statistics and visualize them in Tensorboard.
These statistics include compute time and memory usage for each node in the computation graph.
I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.
If I use the function:
net=feedfowardnet([60 60])
net2=train(net,x,t)
It takes about 20 minuets to train. (I have done this on multiple computers {with the same specs}, and the average time is always around 20 mins)
If I use the function:
parpool %//starts a local parallel pool connected to 2 workers
net2=train(net,x,t,'useParallel','yes')
It takes around 40 minuets to complete training. I have two cores, so this is counter intuitive, it should be twice as fast, not twice as slow. I am using the same starting network, and the same training inputs and targets.
Also, when I open the task manager during NN training, it shows that both CPUs are working at 100%, even when parpool and useParallel are turned off.
This page of the Mathworks website says that "Parallel Computing Toolbox™ allows Neural Network Toolbox™ to simulate and train networks faster and on larger datasets than can fit on one PC. Parallel training is currently supported for backpropagation training only, not for self-organizing maps."
I am using 2000 training examples in the data set. There are 32 inputs and 3 outputs so this is definitely a large data set. The parallel pool is also definitely turned off when I just use the net2=train(net,x,t) function.
I have tested the use of parpool with other functions, (ones containing parfor loops), and the calculation is usually twice as fast. It is just seems to be the neural network training that goes slower.
Is there any reason for this?
I am using an Intel Core 2 Duo E8400 Cpu #3GHz, and I am using MATLAB version R2013 b. I am also using a computer on a network (inside a university). I'm not sure if this makes a difference.
More info on the university computer network. I am using multiple computers on the network at the same time. I have not connected them together in a way to do distributed computing, each one is just doing its own thing using parallel computing on its own 2 processors. However I am not sure if the computers are still interfering with each other in some way because they are logged on with the same user. I load the training input and target data into the matlab workspace on each computer using:
load('H:\18-03-14\x.mat')
load('H:\18-03-14\net.mat')
load('H:\18-03-14\t.mat')
Where H: is the network drive. I am not sure if once these are in the matlab workspace, they are still somehow connected, and interfere with each other across different computers. Do they?
There are two types of parallelism going on: multicore and multithread. MATLAB implements basic matrix operations with multicore support, so even without the parallel computing toolbox, all your cores are being utilized.
The Parallel Computing Toolbox additionally allows for multiple threads. Multiple threads have two advantages. The first is that they allow calculations to be threaded across multiple PCs using MATLAB Distributed Computing Server for a reliable linear speed up. The second is less obvious, which is that multiple threads on a single PC can improve on already single-threaded multicore calculations, even though that might seem counter intuitive. However, there is extra overhead for using parallel threads so this is not always the case. More cores and large problems are more likely to see a speed up.
Your problem is not a large one. A large problem would be 10's of thousands of examples or more, whereas your problem only has 2000.
Also, two hidden layers with 60 neurons each is almost certainly a far larger network than you need. Your problem has 2000 samples * 3 outputs = 6000 constraints. Your network has 32*60+60*60+60*3 weights and 32+60+3 biases for a total of 5795 adjustable variables which is almost as many as the number of constraints. I would suggest far fewer weights and probably only a single hidden layer. This will train much faster, and is also likely to generalize better.
So maybe start with feedforwardnet(100) and then increase that if the desired accuracy is not found.
You can see the benefits of multiple threads on a larger problem using this Neural Network Toolbox example dataset with 68,308 examples:
[x,t] = vinyl_dataset;
net = feedforwardnet(140,'trainscg');
rng(0), tic, net2 = train(net,x,t); toc
parpool
rng(0), tic, net2 = train(net,x,t,'useParallel','yes'); toc
The 100% cpu load without use of the parallel computing toolbox shows, that the function train or the relevant called functions are implemented using multi threading. In these cases, the parallel computing toolbox only adds a useless overhead of inter process communication.
I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.