Problem
I noticed that one of my networks was running slow and nvidia-smi was reporting only around ~10% GPU usage. After running the profiler, I saw that TruncatedNormal process was taking the vast majority of running time (see photo). What could causing this kind of problem?
Code
Weight declaration function (from MNIST tutorial):
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
Code in action:
# First Layer
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [None, Nvars])
w1 = weight_variable([Nvars, 8])
b1 = bias_variable([8])
y1 = tf.nn.relu(tf.matmul(x, w1) + b1)
The issue is that the tensorflow functions that you are using have multiplicative properties. When a CPU needs to invoke the product handler, performance degrades exponentially.
The problem may be avoided if the calculation is pre-supported by the kernel. What you need to do is use compiler optimization; i.e. change some compiler settings which will cause the compiler to 'ease' the multiplication.
Or... you could just not multiply, and just add a few times.
Hope I helped.
Related
I'm relatively new with Julia and I'm currently using version 1.0. I have a code that is intendend to produce a sequence of integers, based on an input matrix. The code takes 3 hours to run on my machine (i5, dual core, 16GB ram), using 16% of CPU and 3% of memory. Is there any basic tips I can learn and apply to optimize my code in Julia to improve its performance? Does indentation have an effect on performance? Is there a package that can track my code and suggest improvements? I provide my code below. The code includes a R code that generates data to which the Julia code is applicable. If an error occurs during the R code, it's just a lack of achievement during simulations and it must be run again until simulation is complete.
using Distances
using RCall
using Distributions
using BSON: #save, #load
using StatsBase
using LinearAlgebra
R"simul<-function(m){
comb<-expand.grid(c(0.01,0.2,0.4),
c(sample(2:7,1),sample(8:12,1),sample(13:20,1)),
c(sample(2:5,1),sample(6:10,1),sample(11:20,1)),
c(150,500,1500))
gener<-function(i){
maxoverlap<-comb[i,1]
nbvar<-comb[i,2]
nbclass<-comb[i,3]
propmix<-runif(1,0.001,1/nbclass)
Q<-MixSim(MaxOmega = maxoverlap, K = nbclass, p = nbvar,PiLow = propmix,resN = 1000)
A <- simdataset(n = comb[i,4], Pi = Q$Pi, Mu = Q$Mu, S = Q$S)
results<-list(Q,A)
return(results)
}
donnees<-sapply(1:nrow(comb),gener)
}
library(MixSim)
donneesimul=simul(1)"
#rget donneesimul
function pointsdpp(t)
datasim=donneesimul[2,t][:X]
Eucldist=pairwise(Euclidean(),transpose(datasim))
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
eigenv=eigvals(L)
prob=eigenv./(eigenv.+1)
eigenvectors=eigvecs(L)
function sampledpp(m)
u=rand(size(L,1))
V=eigenvectors[:,findall(u.<=prob)]
k=size(V,2)
Y=zeros(Int64,k)
for i=k:-1:1
P=sum(V.^2,dims=2)
Pri=P / sum(P)
Cumpri=cumsum(Pri,dims=1)
u=rand()
Y[i]=findfirst(u.<=Cumpri)[1]
if i==1 break end
j=findfirst(V[Y[i],:].!=0)
Vj=V[:,j]
V=V[:,deleteat!(collect(1:1:size(V,2)),j)]
V=V-repeat(Vj,1,size(V,2)).*repeat(transpose(V[Y[i],:]/Vj[Y[i]]),size(V,1))
for a = 1:i-1
for b = 1:a-1
V[:,a] = V[:,a] - transpose(V[:,a])*V[:,b]*V[:,b]
end
V[:,a] = V[:,a] / norm(V[:,a])
end
end
Y=sort(Y)
return(Y)
end
m=collect(1:1000)
sampleY_repet=map(sampledpp,m)
end
w=collect(1:1:81)
echantdpp=map(pointsdpp,w)
#save "echantdppdatasim1.bson" echantdpp
There are many issues to be considered when evaluating Julia performance. While the code you provided is far beyond MWE (minimal working example) and is not reproducible neither.
However, here are some general guidelines:
Take some time to read carefully the Julia performance tips and apply them
Since you process some arrays your code will likely benefit from the #simd macro. Using array views is also very often a low-hanging-fruit for codes such as yours.
You use 16% of CPU power (likely you have 8 cores and your program uses just one). Consider using either multi-threading or multiprocessing - your program will run many times faster
For some scenario you might consider using GPU computing with Flux.jl
Consider moving your multi-core computation to the cloud (Julia scaling on AWS EC2 instances works fantastic)
Since each of those topics is a big area on its own work step-by-step on your code and ask questions to get help.
I want to get inverse matrix, so here is the code:
if (m.rows() == m.fullPivLu().rank())
{
res = m.inverse();
}
The dimension of m and res are all 5000 times 5000. And When I run the code on a high performance computing machine(Linux, Tianhe 2 SuperComputer), the process was kill at res = m.inverse();, and there was no core file or dump information generated. The console return killed and the process exited.
But there is nothing wrong on my Ubuntu laptop.
And the inverse()'s performance is poor and it costs a lot time.
So, why inverse() was killed on the high performance machine? Thank you!
Full-pivoting LU is known to be very slow, regardless of its implementation.
Better use PartialPivLU, which benefits from high performance matrix-matrix operations. Then to get the best of Eigen, use the 3.3-beta2 release and compile with both FMA (-mfma) and OpenMP (e.g., -fopenmp) supports, and don't forget to enable compiler optimizations -O3. This operation should not take more than a few seconds.
Finally, do you really need to explicitly compute the inverse? If you only apply it to some vectors or matrices (i.e., A^-1 * B or B * A^-1) then better apply the inverse in factorized form rather than explicitly computing it. With Eigen 3.3:
MatrixXd A = ...;
PartialPivLU<MatrixXd> lu(A);
x = lu.inverse() * b; // solve Ax=b, same as x = lu.solve(b);
x = b * lu.inverse(); // solve xA=b
In these expressions, the inverse is not explicitly computed!
function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing
This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.
I've been experimenting with the GPU support of Matlab (v2014a). The notebook I'm using to test my code has a NVIDIA 840M build in.
Since I'm new to GPU computing with Matlab, I started out with a few simple examples and observed a strange scalability behavior. Instead of increasing the size of my problem, I simply put a forloop around my computation. I expected the time for the computation, to scale with the number of iterations, since the problem size itself does not increase. This was also true for smaller numbers of iterations, however at a certain point the time does not scale as expected, instead I observe a huge increase in computation time. Afterwards, the problem continues to scale again as expected.
The code example started from a random walk simulation, but I tried to produce an example that is easy and still shows the problem.
Here's what my code does. I initialize two matrices as sin(alpha)and cos(alpha). Then I loop over the number of iterations from 2**1to 2**15. I then repead the computation sin(alpha)^2 and cos(alpha)^2and add them up (this was just to check the result). I perform this calculation as often as the number of iterations suggests.
function gpu_scale
close all
NP = 10000;
NT = 1000;
ALPHA = rand(NP,NT,'single')*2*pi;
SINALPHA = sin(ALPHA);
COSALPHA = cos(ALPHA);
GSINALPHA = gpuArray(SINALPHA); % move array to gpu
GCOSALPHA = gpuArray(COSALPHA);
PMAX=15;
for P = 1:PMAX;
for i=1:2^P
GX = GSINALPHA.^2;
GY = GCOSALPHA.^2;
GZ = GX+GY;
end
end
The following plot, shows the computation time in a log-log plot for the case that I always double the number of iterations. The jump occurs when doubling from 1024 to 2048 iterations.
The initial bump for two iterations might be due to initialization and is not really relevant anyhow.
I see no reason for the jump between 2**10 and 2**11 computations, since the computation time should only depend on the number of iterations.
My question: Can somebody explain this behavior to me? What is happening on the software/hardware side, that explains this jump?
Thanks in advance!
EDIT: As suggested by Divakar, I changed the way I time my code. I wasn't sure I was using gputimeit correctly. however MathWorks suggests another possible way, namely
gd= gpuDevice();
tic
% the computation
wait(gd);
Time = toc;
Using this way to measure my performance, the time is significantly slower, however I don't observe the jump in the previous plot. I added the CPU performance for comparison and keept both timings for the GPU (wait / no wait), which can be seen in the following plot
It seems, that the observed jump "corrects" the timining in the direction of the case where I used wait. If I understand the problem correctly, then the good performance in the no wait case is due to the fact, that we do not wait for the GPU to finish completely. However, then I still don't see an explanation for the jump.
Any ideas?
[Environment: MATLAB 64 bit, Windows 7, Intel I5-2320]
I would like to RMS-fit a function to experimental data y, so I am minimizing the following function (by using fminsearch):
minfunc = rms(y - fitfunc)
From the general point of view, does it make sense to minimize:
minfunc = sum((y - fitfunc) .^ 2)
instead and then (after minimization) just do minfunc = sqrt(minfunc / N) to get the fit RMS error?
To reformulate the question, how much time (roughly, in percent) would fminsearch save by not doing sqrt and 1/(N - 1) each time? I wouldn't like to decrease readability of my code if my CPU / MATLAB are so fast that it wouldn't improve performance by at least a percent.
Update: I've tried simple tests, but the results are not clear: depending on the actual value of the minfunc, fminsearch takes more or less time.
The general answer for performance questions:
If you just want to figure out what is faster, design a benchmark and run it a few times.
By just providing general information it is not likely that you will determine which method is 1 percent faster.