Parallel computing in Colab - parallel-processing

I'm trying to do some parallel computing (frist time ever!) and I don't really know how to do it, or if it is going to speed up my computation.
I have a neural net in a Colab Notebook and I have to run through it the same minibatch of images N times, in order to do some dropout statistics.
It is probably quite a simple task, but I have no idea how to do it. The code would be as easy as:
for i in np.arange(iters):
sample[i] = model(x)
Or something like that, you get the idea.
The thing is that model (x) consumes quite a lot of time, and I would really need to do it in parallel.
Also, a somewhat related question: how many cores does Colab have? The iters should be in the order of 10.000 so, it is probably way too much, isn't it?

This is a simple example on how to paralelize in Colab:
https://colab.research.google.com/drive/1uDd7eq6dAlHGW9o7B-IDResAzShVjh1Z?usp=sharing
from multiprocessing import Pool
import time
def wait_fn(t):
time.sleep(t/10)
data = [1]*10
# Only one thread
start = time.time()
out = [wait_fn(t) for t in data]
end = time.time()
print("Only one thread: ",end - start,"s")
# Palalelize
start = time.time()
with Pool() as p:
out = p.map(wait_fn, data)
end = time.time()
print("Paralelized: ",end - start,"s")
Only one thread: 1.0021319389343262 s
Paralelized: 0.6200098991394043 s
However, this seems to use only two cores. I'm still trying to find out how to use more cores, if possible.

Related

How can I use basic tricks for improving my Julia code?

I'm relatively new with Julia and I'm currently using version 1.0. I have a code that is intendend to produce a sequence of integers, based on an input matrix. The code takes 3 hours to run on my machine (i5, dual core, 16GB ram), using 16% of CPU and 3% of memory. Is there any basic tips I can learn and apply to optimize my code in Julia to improve its performance? Does indentation have an effect on performance? Is there a package that can track my code and suggest improvements? I provide my code below. The code includes a R code that generates data to which the Julia code is applicable. If an error occurs during the R code, it's just a lack of achievement during simulations and it must be run again until simulation is complete.
using Distances
using RCall
using Distributions
using BSON: #save, #load
using StatsBase
using LinearAlgebra
R"simul<-function(m){
comb<-expand.grid(c(0.01,0.2,0.4),
c(sample(2:7,1),sample(8:12,1),sample(13:20,1)),
c(sample(2:5,1),sample(6:10,1),sample(11:20,1)),
c(150,500,1500))
gener<-function(i){
maxoverlap<-comb[i,1]
nbvar<-comb[i,2]
nbclass<-comb[i,3]
propmix<-runif(1,0.001,1/nbclass)
Q<-MixSim(MaxOmega = maxoverlap, K = nbclass, p = nbvar,PiLow = propmix,resN = 1000)
A <- simdataset(n = comb[i,4], Pi = Q$Pi, Mu = Q$Mu, S = Q$S)
results<-list(Q,A)
return(results)
}
donnees<-sapply(1:nrow(comb),gener)
}
library(MixSim)
donneesimul=simul(1)"
#rget donneesimul
function pointsdpp(t)
datasim=donneesimul[2,t][:X]
Eucldist=pairwise(Euclidean(),transpose(datasim))
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
eigenv=eigvals(L)
prob=eigenv./(eigenv.+1)
eigenvectors=eigvecs(L)
function sampledpp(m)
u=rand(size(L,1))
V=eigenvectors[:,findall(u.<=prob)]
k=size(V,2)
Y=zeros(Int64,k)
for i=k:-1:1
P=sum(V.^2,dims=2)
Pri=P / sum(P)
Cumpri=cumsum(Pri,dims=1)
u=rand()
Y[i]=findfirst(u.<=Cumpri)[1]
if i==1 break end
j=findfirst(V[Y[i],:].!=0)
Vj=V[:,j]
V=V[:,deleteat!(collect(1:1:size(V,2)),j)]
V=V-repeat(Vj,1,size(V,2)).*repeat(transpose(V[Y[i],:]/Vj[Y[i]]),size(V,1))
for a = 1:i-1
for b = 1:a-1
V[:,a] = V[:,a] - transpose(V[:,a])*V[:,b]*V[:,b]
end
V[:,a] = V[:,a] / norm(V[:,a])
end
end
Y=sort(Y)
return(Y)
end
m=collect(1:1000)
sampleY_repet=map(sampledpp,m)
end
w=collect(1:1:81)
echantdpp=map(pointsdpp,w)
#save "echantdppdatasim1.bson" echantdpp
There are many issues to be considered when evaluating Julia performance. While the code you provided is far beyond MWE (minimal working example) and is not reproducible neither.
However, here are some general guidelines:
Take some time to read carefully the Julia performance tips and apply them
Since you process some arrays your code will likely benefit from the #simd macro. Using array views is also very often a low-hanging-fruit for codes such as yours.
You use 16% of CPU power (likely you have 8 cores and your program uses just one). Consider using either multi-threading or multiprocessing - your program will run many times faster
For some scenario you might consider using GPU computing with Flux.jl
Consider moving your multi-core computation to the cloud (Julia scaling on AWS EC2 instances works fantastic)
Since each of those topics is a big area on its own work step-by-step on your code and ask questions to get help.

Poor performance when using TensorFlow input functions in a loop

I am currently training a set of around 10,000 images with ten epochs. My question is regarding the following code:
file_contents = cv2.imread(shuffle_image_list[i],3)
resized_image = cv2.resize(file_contents, (72,72), interpolation = cv2.INTER_AREA)
data = np.array(resized_image)
flattened = data.flatten()
#image_batch, label_batch = tf.train.batch([resized_image, shuffle_label_list[i]], batch_size=batch_size) # does train.batch take individual images or final tensors
#if(i>batch_size):
#print(label_batch.eval())
print(str(i))
imageArr.append(flattened)
labelArr.append(int(shuffle_label_list[i]))
if i % 100 == 0:
print("....... " + str(i))
_, c = sess.run([optimizer, cost], feed_dict={x: imageArr, y: labelArr})
epoch_loss += c
imageArr = []
labelArr = []
Here, I am feeding images in mini-batches of 100 to the neural network. The code is initially getting a string, file_contents. Then that string is being decoded into a jpg. However, when I use TensorFlow's functions, such as tf.decode_jpeg and tf.reshape(), etc., there is a difference in speed. When using TensorFlow's functions to accomplish the same task of decoding images, it starts out fast in the training process, and then becomes incredibly slow after the first epoch.
To put this in perspective, using OpenCV, I could train this entire model with 10 epochs in around 1hr and 30 minutes. However, using TensorFlow's functions, it took over 12 hours to get past the first epoch and midway through the second, I stopped the training process after I saw the progress it was making.
I am not sure if this as anything to do with the concept of network slowdown, as seen here. I am simply replacing OpenCV functions with TensorFlow functions to decode an image and read a file. Why is there such as dramatic speed difference between OpenCV and TensorFlow? Why exactly does TensorFlow's functions slow down as the code progresses? Will this have any effect on the accuracy of model?
Note: The only functions I changed were at the top. I didn't use tf.train.batch in either versions. The only thing that was changed was the lines of code from file_contents to data.flatten(), to corresponding tensorflow functions.

polyfit on GPUArray is extremely slow [duplicate]

function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing
This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.

A concrete example of a parfor loop in Matlab that outperforms the for loop

I am still somewhat new to parallel computing in Matlab. I have used OpenMP in C successfully, but could not get better performance in Matlab.
First, since I'm machine at a university that I am new to, I verified that the machine I am on has the Parallel Computing Toolbox by typing ver in the command prompt and it displayed: Parallel Computing Toolbox Version 5.2 (R2011b). Note that the machine has 4 cores
I tried simple examples of using parfor vs. for, but for always won, though this might be because of the overhead cost. I was doing simple things like the example here: MATLAB parfor is slower than for -- what is wrong?
Before trying to apply parfor to my bigger more complicated program (I need to compute 500 evaluations of a function and each evaluation takes about a minute, so parallelizing will help here), I would very much like to see a concrete example where parfor beats for. . Examples are abundant for OpenMP, but did not find a simple example that I can copy and paste that shows parfor is better than for
I use the following code (once per Matlab session) in order to use parfor:
pools = matlabpool('size');
cpus = feature('numCores');
if pools ~= (cpus - 1)
if pools > 0
matlabpool('close');
end
matlabpool('open', cpus - 1);
end
This leaves 1 core for other processes.
Note, the feature() command is undocumented.
There is an example of improved performance from parfor on Loren Shure's MATLAB blog.
Her example is simply computing the rank of a magic square matrix:
function ranks = parMagic(n)
ranks = zeros(1,n);
parfor (ind = 1:n)
ranks(ind) = rank(magic(ind)); % last index could be ind,not n-ind+1
end
Serg describes how to "enable" parallel functionality.
Here is a very simple cut and paste example to test it with as requested. Simply copy and paste the follwing into an mfile and run it.
function parfortest()
enable_parallel;
pause on
tic;
N=500;
for i=1:N
sequential_answer=slow_fun(i);
end
sequential_time=toc
tic;
parfor i=1:N
sequential_answer=slow_fun(i);
end
parallel_time=toc
end
function result=slow_fun(x)
pause(0.001);
result=x;
end
If you have run the code to enable parallel as shown in the answer by Serg you should get a pretty obvious improvement in performance.

Non-linear performance of Java function in parallel MATLAB

Recently, I implemented parallelisation in my MATLAB program, much to the suggestions offered in Slow xlsread in MATLAB. However, implementing the parallelism has cropped up another problem - non-linearly increasing processing time with increasing scale.
The culprit seems to be the java.util.concurrent.LinkedBlockingQueue method as can be seen from the attached images of profiler and the corresponding condensed graphs.
Problem: How do I remove this non-linearity as my work involves processing more than 1000 sheets in single run - which would take an insanely long time?
Note: The parallelised part of the program involves just reading all the .xls files and storing them in matrices, after which I start the remainder of my program. dlmwrite is used towards the end of the program and optimization on its time is not really required, although could also be suggested.
Culprit:
Code being parallelised:
parfor i = 1:runs
sin = 'Sheet';
sno = num2str(i);
sna = strcat(sin, sno);
data(i, :, :) = xlsread('Processes.xls', sna, '' , 'basic');
end
Doing parallel IO operation is likely to be a problem (could be slower in fact) unless maybe if you keep everything on an SSD. If you are always reading the same file and it's not enormous, you may want to try reading it prior to your loop and just doing your data manipulation in parallel.

Resources