Is there a way to do embarrassingly parallel run of a program for multi gpu environment in python? - multiprocessing

I have a node with 4 GPU's attached. I have a python code which consists of a loop that can be embarrassingly parallelized. Currently my program only uses 1 GPU (I use a library which runs the does simulations on GPU and does not support multi GPU). Is there a way in python to run my code on multiple GPU's? I want something analogous to below but for GPUs
from multiprocessing import Pool
def func(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(func, [1, 2, 3]))

I was also was interested in this and tested it. The following code snippet works for me.
import os
import time
import ray
ray.init(num_gpus=2)
#ray.remote(num_gpus=0.25)
def func(x):
print(x * x)
#The eight tasks created here can execute concurrently on two GPUs
ray.get([func.remote(i) for i in range(8)])

I suggest giving Ray a try. See GPU support. You can basically use #ray.remote(num_gpus=1) to annotate func(x) invoke it with func.remote(1).

Related

Parallel computing in Colab

I'm trying to do some parallel computing (frist time ever!) and I don't really know how to do it, or if it is going to speed up my computation.
I have a neural net in a Colab Notebook and I have to run through it the same minibatch of images N times, in order to do some dropout statistics.
It is probably quite a simple task, but I have no idea how to do it. The code would be as easy as:
for i in np.arange(iters):
sample[i] = model(x)
Or something like that, you get the idea.
The thing is that model (x) consumes quite a lot of time, and I would really need to do it in parallel.
Also, a somewhat related question: how many cores does Colab have? The iters should be in the order of 10.000 so, it is probably way too much, isn't it?
This is a simple example on how to paralelize in Colab:
https://colab.research.google.com/drive/1uDd7eq6dAlHGW9o7B-IDResAzShVjh1Z?usp=sharing
from multiprocessing import Pool
import time
def wait_fn(t):
time.sleep(t/10)
data = [1]*10
# Only one thread
start = time.time()
out = [wait_fn(t) for t in data]
end = time.time()
print("Only one thread: ",end - start,"s")
# Palalelize
start = time.time()
with Pool() as p:
out = p.map(wait_fn, data)
end = time.time()
print("Paralelized: ",end - start,"s")
Only one thread: 1.0021319389343262 s
Paralelized: 0.6200098991394043 s
However, this seems to use only two cores. I'm still trying to find out how to use more cores, if possible.

Is there a way to use torch.autograd.gradient in parallel in Pytorch?

I am trying to train some network where the loss is not only a function of the output but also the derivative of the output w.r.t. the input. The problem is that while computing the batch output can be done in parallel with the modules with Pytorh, I can't find a way to do the derivative in parallel. Here's the best I can do in serial:
import torch
x=torch.rand(300,1)
dydx=torch.zeros_like(x)
fc=torch.nn.Linear(1,1)
x.requires_grad=True
for ii in range(x.size(0)):
xi=x[ii,0:]
yi=torch.tanh(fc(xi))
dydx[ii]=torch.autograd.grad(yi,xi,create_graph=True)[0]
dydxsum=(dydx**2).sum()
dydxsum.backward()
In the code above, x is split to save memory and time. However, when the size of x becomes large, parallelization (in CUDA) is still necessary. If it has to be implemented by tinkering Pytorch, a hint to where to start will be appreciated.

Calling functions inside functions vs returning, which is more efficient?

In the below 2 cases, which one is more efficient?
Case 1:
def test(x):
hello(x)
def hello(y):
return world(y)
def world(z):
return z
Case 2:
def test(x):
a = hello(x)
def hello(y)
b = world(y)
return b
def world(z):
c = z
return z
TL;DR: both are just as fast.
Here are the timings on my machine with the CPython 3.7.6 interpreter (on a loop of 10000000 run repeated 3 times):
First case: 109 ns
Second case: 111 ns
The difference is negligible between the two. Thus, the two are as fast.
Using PyPy (JIT-based interpreter), the JIT compiler is able to optimize both cases and remove all the function calls. Thanks to the JIT, the code can be executed at a zero cost and the time of the two cases is thus exactly the same.
If you use Cython the same kind of things could occur.
Advice:
Please do not tweak a Python code with micro-optimizations like this for better performance. Readability takes precedence over performance in Python.*
If you really care about performance of such a thing, please use an alternative interpreter like PyPy or Cython or an integrated Python JIT compiler (such as Numba).

How can I use basic tricks for improving my Julia code?

I'm relatively new with Julia and I'm currently using version 1.0. I have a code that is intendend to produce a sequence of integers, based on an input matrix. The code takes 3 hours to run on my machine (i5, dual core, 16GB ram), using 16% of CPU and 3% of memory. Is there any basic tips I can learn and apply to optimize my code in Julia to improve its performance? Does indentation have an effect on performance? Is there a package that can track my code and suggest improvements? I provide my code below. The code includes a R code that generates data to which the Julia code is applicable. If an error occurs during the R code, it's just a lack of achievement during simulations and it must be run again until simulation is complete.
using Distances
using RCall
using Distributions
using BSON: #save, #load
using StatsBase
using LinearAlgebra
R"simul<-function(m){
comb<-expand.grid(c(0.01,0.2,0.4),
c(sample(2:7,1),sample(8:12,1),sample(13:20,1)),
c(sample(2:5,1),sample(6:10,1),sample(11:20,1)),
c(150,500,1500))
gener<-function(i){
maxoverlap<-comb[i,1]
nbvar<-comb[i,2]
nbclass<-comb[i,3]
propmix<-runif(1,0.001,1/nbclass)
Q<-MixSim(MaxOmega = maxoverlap, K = nbclass, p = nbvar,PiLow = propmix,resN = 1000)
A <- simdataset(n = comb[i,4], Pi = Q$Pi, Mu = Q$Mu, S = Q$S)
results<-list(Q,A)
return(results)
}
donnees<-sapply(1:nrow(comb),gener)
}
library(MixSim)
donneesimul=simul(1)"
#rget donneesimul
function pointsdpp(t)
datasim=donneesimul[2,t][:X]
Eucldist=pairwise(Euclidean(),transpose(datasim))
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
eigenv=eigvals(L)
prob=eigenv./(eigenv.+1)
eigenvectors=eigvecs(L)
function sampledpp(m)
u=rand(size(L,1))
V=eigenvectors[:,findall(u.<=prob)]
k=size(V,2)
Y=zeros(Int64,k)
for i=k:-1:1
P=sum(V.^2,dims=2)
Pri=P / sum(P)
Cumpri=cumsum(Pri,dims=1)
u=rand()
Y[i]=findfirst(u.<=Cumpri)[1]
if i==1 break end
j=findfirst(V[Y[i],:].!=0)
Vj=V[:,j]
V=V[:,deleteat!(collect(1:1:size(V,2)),j)]
V=V-repeat(Vj,1,size(V,2)).*repeat(transpose(V[Y[i],:]/Vj[Y[i]]),size(V,1))
for a = 1:i-1
for b = 1:a-1
V[:,a] = V[:,a] - transpose(V[:,a])*V[:,b]*V[:,b]
end
V[:,a] = V[:,a] / norm(V[:,a])
end
end
Y=sort(Y)
return(Y)
end
m=collect(1:1000)
sampleY_repet=map(sampledpp,m)
end
w=collect(1:1:81)
echantdpp=map(pointsdpp,w)
#save "echantdppdatasim1.bson" echantdpp
There are many issues to be considered when evaluating Julia performance. While the code you provided is far beyond MWE (minimal working example) and is not reproducible neither.
However, here are some general guidelines:
Take some time to read carefully the Julia performance tips and apply them
Since you process some arrays your code will likely benefit from the #simd macro. Using array views is also very often a low-hanging-fruit for codes such as yours.
You use 16% of CPU power (likely you have 8 cores and your program uses just one). Consider using either multi-threading or multiprocessing - your program will run many times faster
For some scenario you might consider using GPU computing with Flux.jl
Consider moving your multi-core computation to the cloud (Julia scaling on AWS EC2 instances works fantastic)
Since each of those topics is a big area on its own work step-by-step on your code and ask questions to get help.

Safety of sharing a read-only scipy sparse matrix between multiple processes

I have a computation I must do which is somewhat expensive and I want to spawn multiple processes to complete it. The gist is more or less this:
1) I have a big scipy.sparse.csc_matrix (could use other sparse format if needed) from which I'm going to read (only read, never write) data for the calculation.
2) I must do lots of embarrassingly parallel calculations and return values.
So I did something like this:
import numpy as np
from multiprocessing import Process, Manager
def f(instance, big_matrix):
"""
This is the actual thing I want to calculate. This reads lots of
data from big_matrix but never writes anything to it.
"""
return stuff_calculated
def do_some_work(big_matrix, instances, outputs):
"""
This do a few chunked calculations for a few instances and
saves the result in `outputs`, which is a memory shared dictionary.
"""
for instance in instances:
x = f(instance, big_matrix)
outputs[instance] = x
def split_work(big_matrix, instances_to_calculate):
"""
Split do_some_work into many processes by chunking instances_to_calculate,
creating a shared dictionary and spawning and joining the processes.
"""
# break instance list into 4 chunks to pass each process
instance_sets = np.array_split(instances_to_calculate, 4)
manager = Manager()
outputs = manager.dict()
processes = [
Process(target=do_some_work, args=(big_matrix, instance_sets, outputs))
for instances in instance_sets
]
for p in processes:
p.start()
for p in processes:
p.join()
return user_sets, outputs
My question is: is this safe? My function f never writes anything, but I'm not taking any precaution to share the big_array between processes, just passing it as it is. It seems to be working but I'm concerned if I can corrupt anything just by passing a value between multiple processes even if I never write to it.
I tried to use the sharemem package to share the matrix between multiple processes but it seems to be unable to hold scipy sparse matrices, only normal numpy arrays.
If this isn't safe, how can I share (read only) big sparse matrices between processes without problems?
I've saw here that I can make another csc_matrix pointing to the same memory with:
other_matrix = csc_matrix(
(bit_matrix.data, bit_matrix.indices, bit_matrix.indptr),
shape=bit_matrix.shape,
copy=False
)
Will this make it safer or would it be the same just as safe as passing the original object?
Thanks.
As explained here it seems your first option creates one copy of the sparse matrix per process. This is safe, but isn't ideal from a performance point of view. However, depending on the computation you perform on the sparse matrix, the overhead may not be signficant.
I suspect a cleaner option using the multiprocessing lib would be to create three lists (depending on the matrix format you use) and populate these with the values, row_ind and col_ptr of your CSC matrix. The documentation for multiprocessing shows how this can be done using an Array or using the Manager and one of the supported types.
Afterwards I don't see how you could run into trouble using read-only operations and it may be more efficient.

Resources