Titan XP vs Quadro P400 GPU in Pytorch - performance

I gave the the two GPUs on my machine a try and I expected the Titan-XP to be faster than the Quadro-P400. However, both gave almost the same execution time.
I need to know if PyTorch will dynamically choose one GPU over another, or, I myself will have to specify which one PyTorch will use, during run-time.
Here is the code snippet used in the test:
import torch
import time
def do_something(gpu_device):
torch.cuda.set_device(gpu_device) # torch.cuda.set_device(device_num)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
a = torch.randn(100000000).cuda()
xx = time.time() - strt
print("execution time, to create 1E8 random numbers, is ", xx)
# print(a)
# print(a + 2)
no_of_GPUs= torch.cuda.device_count()
print("how many GPUs are there:", no_of_GPUs)
for i in range(0, no_of_GPUs):
print(i, "th GPU is", torch.cuda.get_device_name(i))
do_something(i)
Sample output:
how many GPUs are there: 2
0 th GPU is TITAN Xp COLLECTORS EDITION
current GPU device 0
execution time, to create 1E8 random numbers, is 5.527713775634766
1 th GPU is Quadro P400
current GPU device 1
execution time, to create 1E8 random numbers, is 5.511776685714722

Despite what you might believe, the lack of performance difference which you see is because the random number generation is being run on your host CPU not the GPU. If I modify your do_something routine like this:
def do_something(gpu_device, ongpu=False, N=100000000):
torch.cuda.set_device(gpu_device)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
if ongpu:
a = torch.cuda.FloatTensor(N).normal_()
else:
a = torch.randn(N).cuda()
print("execution time, to create 1E8 random no, is ", time.time() - strt)
return a
and run it two ways, I get very different execution times:
In [4]: do_something(0)
current GPU device 0
execution time, to create 1E8 random no, is 7.736972808837891
Out[4]:
-9.3955e-01
-1.9721e-01
-1.1502e+00
......
-1.2428e+00
3.1547e-01
-2.1870e+00
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
In [5]: do_something(0,True)
current GPU device 0
execution time, to create 1E8 random no, is 0.001735687255859375
Out[5]:
4.1403e+06
5.7016e+06
1.2710e+07
......
8.9790e+06
1.3779e+07
8.0731e+06
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
i.e. your version takes 7 seconds and mine takes 1.7ms. I think it is obvious which one ran on the GPU....

Related

Tensorflow conv1d/Keras Conv1D strange performance variation

I am getting somewhat unexpected results when measuring the processing runtime of the Conv1D layer and wonder if anybody understands the results. Before going on I note that the observation is not only linked to the Conv1D layer but can be observed similarly for the tf.nn.conv1d function.
The code I am using is very simple
import os
# silence verbose TF feedback
if 'TF_CPP_MIN_LOG_LEVEL' not in os.environ:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
import tensorflow as tf
import time
def fun(sigl, cc, bs=10):
oo = tf.ones((bs, sigl, 200), dtype=tf.float32)
start_time = time.time()
ss=cc(oo).numpy()
dur = time.time() - start_time
print(f"size {sigl} time: {dur:.3f} speed {bs*sigl / 1000 / dur:.2f}kHz su {ss.shape}")
cctf2t = tf.keras.layers.Conv1D(100,10)
for jj in range(2):
print("====")
for ii in range(30):
fun(10000+ii, cctf2t, bs=10)
I was expecting to observe the first call to be slow and the others to show approximately similar runtime. It turns out that the behavior is quite different.
Assuming the code above is stored in a script called debug_conv_speed.py I get the following on an NVIDIA GeForce GTX 1050 Ti
$> ./debug_conv_speed.py
====
size 10000 time: 0.901 speed 111.01kHz su (10, 9991, 100)
size 10001 time: 0.202 speed 554.03kHz su (10, 9992, 100)
...
size 10029 time: 0.178 speed 563.08kHz su (10, 10020, 100)
====
size 10000 time: 0.049 speed 2027.46kHz su (10, 9991, 100)
...
size 10029 time: 0.049 speed 2026.87kHz su (10, 10020, 100)
where ... indicates approximately the same result. So as expected, the first time is slow, then for each input length, I get the same speed of about 550kHz. But then for the repetition, I am astonished to find all operations to run about 4 times faster, with 2MHz.
The results are even more different on a GeForce GTX 1080. There the first time a length is used it runs at about 200kHz, and for the repetitions, I find a speed of 1.8MHz.
In response to the https://stackoverflow.com/a/71184388/3932675 I add a second variant of the code that uses tf.function a
import os
# silence verbose TF feedback
if 'TF_CPP_MIN_LOG_LEVEL' not in os.environ:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
import tensorflow as tf
import time
from functools import partial
print(tf.config.list_physical_devices())
class run_fun(object):
def __init__(self, ll, channels):
self.op = ll
self.channels = channels
#tf.function(input_signature=(tf.TensorSpec(shape=[None,None,None]),),
experimental_relax_shapes=True)
def __call__(self, input):
print("retracing")
return self.op(tf.reshape(input, (tf.shape(input)[0], tf.shape(input)[1], self.channels)))
def run_layer(sigl, ll, bs=10):
oo = tf.random.normal((bs, sigl, 200), dtype=tf.float32)
start_time = time.time()
ss=ll(oo).numpy()
dur = time.time() - start_time
print(f"len {sigl} time: {dur:.3f} speed {bs*sigl / 1000 / dur:.2f}kHz su {ss.shape}")
ww= tf.ones((10, 200, 100))
ll=partial(tf.nn.conv1d, filters=ww, stride=1, padding="VALID", data_format="NWC")
run_ll = run_fun(ll, 200)
for jj in range(2):
print(f"=== run {jj+1} ===")
for ii in range(5):
run_layer(10000+ii, run_ll)
# alternatively for eager mode run
# run_layer(10000+ii, ll)
the result after running on google's colab GPU
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
=== run 1 ===
retracing
len 10000 time: 10.168 speed 9.83kHz su (10, 9991, 100)
len 10001 time: 0.621 speed 161.09kHz su (10, 9992, 100)
len 10002 time: 0.622 speed 160.80kHz su (10, 9993, 100)
len 10003 time: 0.644 speed 155.38kHz su (10, 9994, 100)
len 10004 time: 0.632 speed 158.18kHz su (10, 9995, 100)
=== run 2 ===
len 10000 time: 0.080 speed 1253.34kHz su (10, 9991, 100)
len 10001 time: 0.053 speed 1898.41kHz su (10, 9992, 100)
len 10002 time: 0.052 speed 1917.43kHz su (10, 9993, 100)
len 10003 time: 0.067 speed 1499.43kHz su (10, 9994, 100)
len 10004 time: 0.095 speed 1058.60kHz su (10, 9995, 100)
This shows that with the given tf.function args retracing is not happening and the performance shows the same difference.
Does anybody know how to explain this?
The reason for your comparatively slow first iteration is that you are feeding different shapes into cctf2t, which triggers a retracting of your compute graph.
In the 2nd, and all subsequent, iteration, you no longer encounter new shapes and therefore no further retracings.
I am pretty sure to have found the explanation in the source of TensorFlow cudnn, and share the insight here for others (notably those who upvoted the question) that encounter the same problem.
cuda supports a number of convolution kernels that in the current version of TensorFlow 2.9.0 are obtained by means of CudnnSupport::GetConvolveRunners
here
https://github.com/tensorflow/tensorflow/blob/21368c687cafdf97fac3dd0eefaed710df0068a2/tensorflow/stream_executor/cuda/cuda_dnn.cc#L4557
Which is then used here in the various autotune functions
https://github.com/tensorflow/tensorflow/blob/21368c687cafdf97fac3dd0eefaed710df0068a2/tensorflow/core/kernels/conv_ops_gpu.cc#L365
It appears that each time a configuration consisting of data shape, filter shape, and maybe other parameters are encountered the cuda driver tests all of the kernels and retains the most efficient one. This is a very nice optimization for most cases, notably training with constant batch shapes, or inference with constant image sizes. For inference with audio signals that may have arbitary lengths (e.g. audio signals with 48000Hz sample rate covering duration from 1s to 20s have nearly 1 million different lengths), the cuda implementation is testing most of the time all kernels versions. It hardly ever benefits, from the information which of the kernels is the most efficient one for any given configuration, as the same configuration is hardly ever encountered a second time.
For my use case, I now use overlap-add-based processing with fixed signal length and improved inference time by about factor 4.

temperary shared parallelism (OMP) within distributed environment (MPI)

Consider the following scenario:
We use n MPI processes to evaluate chunks of a large matrix A. Each chunk is evaluated by a unique process. Number of OMP threads is 1.
We gather these chunks in the master process (rank=0) and build the full matrix A.
The master process calls a function x = f(A) and broadcasts the result x to all processes.
The issue is that f(A) takes a long time, and other processes have to wait.
If the function f can be accelerated with OMP parallelism, does it make sense to set the number of OMP threads (on the master process) equal to n before calling f(A) and changing it to 1 afterwards?
In other words, what can be wrong with the following pseudo-code?
mpirun -n n ./exe # command for running the code
where the pseudo-code for exe looks something like
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
set_num_omp_threads(n)
x = f(A)
set_num_omp_threads(1)
else
x = 0
mpi_bcast(x, mpi_rank=0)
what are the possible performance issues/pitfalls?
EDIT: The original code is too complicated, but I have replicated the issue via the following code.
# run by: mpirun -n 6 python test.py
import time
import torch
import torch.distributed as dist
def get_weights(A, Y, use_omp=False):
# maximize OMP threads
master = dist.get_rank() == 0
if use_omp and master:
ws = dist.get_world_size()
th = torch.get_num_threads()
torch.set_num_threads(ws*th)
# actual calculation; only master
if dist.get_rank() == 0:
Q, R = torch.qr(A)
W = (R.inverse()#Q.t()#Y)
print(torch.get_num_threads())
else:
W = torch.zeros(A.shape[1], 1)
dist.broadcast(W, 0)
# reset OMP threads
if use_omp and master:
torch.set_num_threads(th)
return W
def test(n=100000, m=300, s=10, use_omp=False):
# ...
# normal distributed code ...
# ...
A = torch.rand(n, m)
_W = torch.rand(m, 1)
Y = A # _W
W = get_weights(A, Y, use_omp=use_omp) # <- this
res = W.allclose(_W, atol=1e-3)
return res
def timeit(repeat=10, use_omp=False):
t1 = time.time()
for _ in range(repeat):
test(use_omp=use_omp)
t2 = time.time()
return (t2-t1)/repeat
if __name__ == '__main__':
dist.init_process_group('mpi')
print(timeit(use_omp=False)) # -> 3.036880087852478
print(timeit(use_omp=True)) # -> 1.3959508180618285
In the above example setting threads to 6 improved the speed by a factor of ~2. But when I try sth similar to this in my actual code (and on a much larger cluster) it became almost 2 times slower!
EDIT2: The essence of this question is that (on a single node with n cores) most of the calculation (70%) is optimal with MPI using n processes, the other part (30%) is optimal with n OMP threads. I was wondering about optimal utilization of the available cores in both parts.
Although the comments were very helpful, I guess there are no easy answers. In this particular case the mentioned region is a linear algebra problem and using scalapack is probably the best solution.
But the question stands for general cases.
I think from what I understood from your question is to launch the OpenMP threads for the computation and then shutdown the OpenMP implementation. With your pseudo-code example, it would look like this:
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
x = f(A) ! do a parallel region in f()
omp_pause_resource_all(omp_pause_hard)
else
x = 0
mpi_bcast(x, mpi_rank=0)
See the functions omp_pause_resource_all() and omp_pause_resource() of the OpenMP API specification. They terminate most of the OpenMP implementation, such that only a very small part that should not affect performance will remain.
Note, the functions were introduced with OpenMP API version 5.0, so you will need a fairly new OpenMP compiler for this to work.
Arrange your MPI processes so that node 0 has only one process, and all other nodes have one process per core. Then process zero can easily launch an OpenMP (or otherwise threaded) shared-memory parallel section.

Performance expectations when running caret::train() to develop a kknn model

I'm using the caret::train() function to develop a weighted knn classification model (kknn) with 10-fold cross-validation and a tuneGrid containing 15 values for kmax, one value for distance, and 3 values for kernel.
That’s 450 total iterations if I understand the process correctly (an iteration being the computation of the probability of a given outcome for a given combination of kmax, distance, and kernel). x has about 480,000 data points (6 predictors each having about 80,000 observations), and y has about 80,000 data points.
Understanding that there are innumerable variables affecting performance, how long can I reasonably expect the train function to take if run on a pc with an 8-core 3GHz Intel processor and 32GB of RAM?
It currently takes about 70 minutes per fold, which is about 1.5 minutes per iteration. Is this reasonable, or excessive?
This is a kknn learning exercise. I realize there are other types of algorithms that produce better results more efficiently.
Here is the essential code:
x <- as.matrix(train_set2[, c("n_launch_angle", "n_launch_speed", "n_spray_angle_Kolp", "n_spray_angle_adj", "n_hp_to_1b", "n_if_alignment")])
y <- train_set2$events
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10, p = 0.8, returnData = TRUE,
returnResamp = "all", savePredictions = "all",
summaryFunction = twoClassSummary, classProbs = TRUE,
verboseIter = TRUE)
tuneGrid <- expand.grid(kmax = seq(11, 39, 2),
distance = 2,
kernel = c("triangular", "gaussian", "optimal"))
kknn_train <- train(x, y, method = "kknn",
tuneGrid = tuneGrid, trControl = fitControl)
As we have established in the comments, it is reasonable to expect this type of runtime. There are a few step to reduce this;
Running your code in parallel
Using a more efficient OS; like Linux
Be more efficient in your trainControl(), is it really necessary to have returnResamps=TRUE? There is small gains in controlling these.
Clearly, the first one is a no-brainer. For the second one, I can find as many computer-engineers who swears to linux as those who swears to windows. What convinced me to switch to Linux, was this particular test, which I hope will give you what it gave me.
# Calculate distance matrix
test_data <- function(dim, num, seed = 1903) {
set.seed(seed)
dist(
matrix(
rnorm(dim * num), nrow = num
)
)
}
# Benchmarking
microbenchmark::microbenchmark(test_data(120,4500))
This piece of code simply just runs faster on the exact same machine that runs Linux. At least this was my experience.

R caret: fully reproducible results with parallel rfe on different machines

I have the following code using random forest as method which is fully reproducible if you run it in parallel mode on the same machine:
library(doParallel)
library(caret)
recursive_feature_elimination <- function(dat){
all_preds <- dat[,which(names(dat) %in% c("Time", "Chick", "Diet"))]
response <- dat[,which(names(dat) == "weight")]
sizes <- c(1:(ncol(all_preds)-1))
# set seeds manually
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
# an optional vector of integers for the size. The vector should have length of length(sizes)+1
# length is n_repeats*nresampling+1
seeds <- vector(mode = "list", length = 16)
for(i in 1:15) seeds[[i]]<- sample.int(n=1000, size = length(sizes)+1)
# for the last model
seeds[[16]]<-sample.int(1000, 1)
seeds_list <- list(rfe_seeds = seeds,
train_seeds = NA)
# specify rfeControl
contr <- caret::rfeControl(functions=rfFuncs, method="repeatedcv", number=3, repeats=5,
saveDetails = TRUE, seeds = seeds, allowParallel = TRUE)
# recursive feature elimination caret
results <- caret::rfe(x = all_preds,
y = response,
sizes = sizes,
method ="rf",
ntree = 250,
metric= "RMSE",
rfeControl=contr )
return(results)
}
dat <- as.data.frame(ChickWeight)
cores <- detectCores()
cl <- makePSOCKcluster(cores, outfile="")
registerDoParallel(cl)
results <- recursive_feature_elimination(dat)
stopCluster(cl)
registerDoSEQ()
The outcome on my machine is:
Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
1 39.14 0.6978 24.60 2.755 0.02908 1.697
2 23.12 0.8998 13.90 2.675 0.02273 1.361 *
3 28.18 0.8997 20.32 2.243 0.01915 1.225
The top 2 variables (out of 2):
Time, Chick
I am using a Windows OS with one CPU and 4 cores. If the code is run on a UNIX OS using multiple CPUs with multiple cores, the outcome is different. I think this behaviour shows up because of the random number generation, which obviously differs between my system and the multi-CPU system. The same happens with train().
How can I get fully reproducible results independent of the OS and independent of how many CPUs and cores used for parallelization?
How can I assure that the same random numbers are used for each internal process of rfe and randomForest no matter in which sequence during the parallel computing the process is run?
How are the random numbers generated for each parallel process?

Spark scalability

I use currently one master (local machine) and two workers (2*32 cores, Memory 2*61.9 GB) for standard ALS algorithm of Spark and produce the next code for the time evaluation:
import numpy as np
from scipy.sparse.linalg import spsolve
import random
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
import hashlib
#Spark configuration settings
conf = SparkConf().setAppName("Temp").setMaster("spark://<myip>:7077").set("spark.cores.max","64").set("spark.executor.memory", "61g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
#first time
t1 = time.time()
#load the DataFrame and transform it into RDD<Rating>
rddob = sqlContext.read.json("file.json").rdd
rdd1 = rddob.map(lambda line:(line.ColOne, line.ColTwo))
rdd2 = rdd1.map(lambda line: (line, 1))
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
ratings = rdd3.map(lambda (line, rating): Rating(int(hash(line[0]) % (10 ** 8)), int(line[1]), float(rating)))
ratings.cache()
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 5
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
#second time
t2 = time.time()
#print results
print "Time of ALS",t2-t1
In this code I hold all parameters constant excepted parameter set("spark.cores.max","x") for which I use the next values for x: 1,2,4,8,16,32,64. I got the next time evaluation:
#cores time [s]
1 20722
2 11803
4 5596
8 3131
16 2125
32 2000
64 2051
The results of evaluation are a little bit strange for me. I see a good linear scalability by the small number of cores. But in the range of 16, 32 and 64 possible cores I don't see either any scalability, or improvement of time performance anymore. How is it possible? My input file is approximately 70 GB big and has 200 000 000 lines.
Linear scalability in distributed system like Spark is only in a small part a result of increasing number of cores. The most important part is opportunity to distribute disk / network IO. If you have constant number of workers and don't scale storage at the same time you'll quickly get to the point where throughput is limited by IO.

Resources