temperary shared parallelism (OMP) within distributed environment (MPI) - performance

Consider the following scenario:
We use n MPI processes to evaluate chunks of a large matrix A. Each chunk is evaluated by a unique process. Number of OMP threads is 1.
We gather these chunks in the master process (rank=0) and build the full matrix A.
The master process calls a function x = f(A) and broadcasts the result x to all processes.
The issue is that f(A) takes a long time, and other processes have to wait.
If the function f can be accelerated with OMP parallelism, does it make sense to set the number of OMP threads (on the master process) equal to n before calling f(A) and changing it to 1 afterwards?
In other words, what can be wrong with the following pseudo-code?
mpirun -n n ./exe # command for running the code
where the pseudo-code for exe looks something like
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
set_num_omp_threads(n)
x = f(A)
set_num_omp_threads(1)
else
x = 0
mpi_bcast(x, mpi_rank=0)
what are the possible performance issues/pitfalls?
EDIT: The original code is too complicated, but I have replicated the issue via the following code.
# run by: mpirun -n 6 python test.py
import time
import torch
import torch.distributed as dist
def get_weights(A, Y, use_omp=False):
# maximize OMP threads
master = dist.get_rank() == 0
if use_omp and master:
ws = dist.get_world_size()
th = torch.get_num_threads()
torch.set_num_threads(ws*th)
# actual calculation; only master
if dist.get_rank() == 0:
Q, R = torch.qr(A)
W = (R.inverse()#Q.t()#Y)
print(torch.get_num_threads())
else:
W = torch.zeros(A.shape[1], 1)
dist.broadcast(W, 0)
# reset OMP threads
if use_omp and master:
torch.set_num_threads(th)
return W
def test(n=100000, m=300, s=10, use_omp=False):
# ...
# normal distributed code ...
# ...
A = torch.rand(n, m)
_W = torch.rand(m, 1)
Y = A # _W
W = get_weights(A, Y, use_omp=use_omp) # <- this
res = W.allclose(_W, atol=1e-3)
return res
def timeit(repeat=10, use_omp=False):
t1 = time.time()
for _ in range(repeat):
test(use_omp=use_omp)
t2 = time.time()
return (t2-t1)/repeat
if __name__ == '__main__':
dist.init_process_group('mpi')
print(timeit(use_omp=False)) # -> 3.036880087852478
print(timeit(use_omp=True)) # -> 1.3959508180618285
In the above example setting threads to 6 improved the speed by a factor of ~2. But when I try sth similar to this in my actual code (and on a much larger cluster) it became almost 2 times slower!
EDIT2: The essence of this question is that (on a single node with n cores) most of the calculation (70%) is optimal with MPI using n processes, the other part (30%) is optimal with n OMP threads. I was wondering about optimal utilization of the available cores in both parts.
Although the comments were very helpful, I guess there are no easy answers. In this particular case the mentioned region is a linear algebra problem and using scalapack is probably the best solution.
But the question stands for general cases.

I think from what I understood from your question is to launch the OpenMP threads for the computation and then shutdown the OpenMP implementation. With your pseudo-code example, it would look like this:
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
x = f(A) ! do a parallel region in f()
omp_pause_resource_all(omp_pause_hard)
else
x = 0
mpi_bcast(x, mpi_rank=0)
See the functions omp_pause_resource_all() and omp_pause_resource() of the OpenMP API specification. They terminate most of the OpenMP implementation, such that only a very small part that should not affect performance will remain.
Note, the functions were introduced with OpenMP API version 5.0, so you will need a fairly new OpenMP compiler for this to work.

Arrange your MPI processes so that node 0 has only one process, and all other nodes have one process per core. Then process zero can easily launch an OpenMP (or otherwise threaded) shared-memory parallel section.

Related

R caret: fully reproducible results with parallel rfe on different machines

I have the following code using random forest as method which is fully reproducible if you run it in parallel mode on the same machine:
library(doParallel)
library(caret)
recursive_feature_elimination <- function(dat){
all_preds <- dat[,which(names(dat) %in% c("Time", "Chick", "Diet"))]
response <- dat[,which(names(dat) == "weight")]
sizes <- c(1:(ncol(all_preds)-1))
# set seeds manually
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
# an optional vector of integers for the size. The vector should have length of length(sizes)+1
# length is n_repeats*nresampling+1
seeds <- vector(mode = "list", length = 16)
for(i in 1:15) seeds[[i]]<- sample.int(n=1000, size = length(sizes)+1)
# for the last model
seeds[[16]]<-sample.int(1000, 1)
seeds_list <- list(rfe_seeds = seeds,
train_seeds = NA)
# specify rfeControl
contr <- caret::rfeControl(functions=rfFuncs, method="repeatedcv", number=3, repeats=5,
saveDetails = TRUE, seeds = seeds, allowParallel = TRUE)
# recursive feature elimination caret
results <- caret::rfe(x = all_preds,
y = response,
sizes = sizes,
method ="rf",
ntree = 250,
metric= "RMSE",
rfeControl=contr )
return(results)
}
dat <- as.data.frame(ChickWeight)
cores <- detectCores()
cl <- makePSOCKcluster(cores, outfile="")
registerDoParallel(cl)
results <- recursive_feature_elimination(dat)
stopCluster(cl)
registerDoSEQ()
The outcome on my machine is:
Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
1 39.14 0.6978 24.60 2.755 0.02908 1.697
2 23.12 0.8998 13.90 2.675 0.02273 1.361 *
3 28.18 0.8997 20.32 2.243 0.01915 1.225
The top 2 variables (out of 2):
Time, Chick
I am using a Windows OS with one CPU and 4 cores. If the code is run on a UNIX OS using multiple CPUs with multiple cores, the outcome is different. I think this behaviour shows up because of the random number generation, which obviously differs between my system and the multi-CPU system. The same happens with train().
How can I get fully reproducible results independent of the OS and independent of how many CPUs and cores used for parallelization?
How can I assure that the same random numbers are used for each internal process of rfe and randomForest no matter in which sequence during the parallel computing the process is run?
How are the random numbers generated for each parallel process?

Concurrence of Julia Parallel Computing

I am new to Julia and studying Julia parallel computing recently.
I am still not clear about the accurate mechanism of Julia's parallelism including macros \#sync and \#async after I read the relevant documents.
The following is the pmap function from the Julia v0.5 documentation:
function pmap(f, lst)
np = nprocs() # determine the number of processes available
n = length(lst)
results = Vector{Any}(n)
i = 1
# function to produce the next work item from the queue.
# in this case it's just an index.
nextidx() = (idx=i; i+=1; idx)
#sync begin
for p=1:np
if p != myid() || np == 1
#async begin
while true
idx = nextidx()
if idx > n
break
end
results[idx] = remotecall_fetch(f, p, lst[idx])
end
end
end
end
end
results
end
Is it possible for different two processors/workers call nextidx() at the same time getting the same idx = j? If yes, I feel results[j] will be computed twice and result[j+1] will not be computed.
Thanks very much.
More findings:
function f1()
i=1
nextidx()=(idx=i;sleep(1);i+=1;idx)
for p=1:2
#async begin
idx=nextidx()
println(idx)
end
end
end
f1()
The result is 1 1.
Through this I find the time periods during which the two tasks call the function nextidx() could overlap. So I feel that in the first code,
if np = 3 (i.e. two workers), and the length n of lst is very large, say 10^8,
it's possible for the tasks to get the same index.
It may happen just because of a coincidence in time, i.e., the two tasks take the expression idx = i at almost the same time point, so the code is not stable.
Am I right?
No, so long as you schedule jobs correctly. The pmap documentation points out that the master process schedules the jobs serially; only the parallel execution is asynchronous. pmap as coded requires no thread locks to ensure correct job scheduling. Adding sleep to nextidx deliberately breaks this feature and introduces the race condition that you observe. Assuming that all of the processes share state -- which is the purpose of the nextidx function -- then the master process will not schedule the same job twice.

Huge memory allocation running a julia function?

I try to run the following function in julia command, but when timing the function I see too much memory allocations which I can't figure out why.
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
for m = 1:length(Pf)
i = 0
for k = 1:iters
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
#thresh = erfcinv(2Pf) * sqrt(2/L) + 1
#Pd_the = 0.5 * erfc(((thresh - (snr + 1)) * sqrt(L)) / (2*(snr + 1)))
end
Running that function in the julia command on my laptop, I get the following shocking numbers:
julia> #time pdpf(1000, 10000)
17.621551 seconds (9.00 M allocations: 30.294 GB, 7.10% gc time)
What is wrong with my code? Any help is appreciated.
I don't think this memory allocation is so surprising. For instance, consider all of the times that the inner loop gets executed:
for m = 1:length(Pf) this gives you 100 executions
for k = 1:iters this gives you 10,000 executions based on the arguments you supply to the function.
randn(L) this gives you a random vector of length 1,000, based on the arguments you supply to the function.
Thus, just considering these, you've got 100*10,000*1000 = 1 billion Float64 random numbers being generated. Each one of them takes 64 bits = 8 bytes. I.e. 8GB right there. And, you've got two calls to randn(L) which means that you're at 16GB allocations already.
You then have y = s + n which means another 8GB allocations, taking you up to 24GB. I haven't looked in detail on the remaining code to get you from 24GB to 30GB allocations, but this should show you that it's not hard for the GB allocations to start adding up in your code.
If you're looking at places to improve, I'll give you a hint that these lines can be improved by using the properties of normal random variables:
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
You should easily be able to cut down the allocations here from 24GB to 8GB in this way. Note that y will be a normal random variable here as you've defined it, and think up a way to generate a normal random variable with an identical distribution to what y has now.
Another small thing, snr is a constant inside your function. Yet, you keep taking its sqrt 1 million separate times. In some settings, 'checking your work' can be helpful, but I think that you can be confident the computer will get it right the first time and thus you don't need to make it keep re-doing this calculation ; ). There are other similar places you can improve your code to avoid duplicate computations here that I'll leave to you to locate.
aireties gives a good answer for why you have so many allocations. You can do more to reduce the number of allocations. Using this property we know that y = s+n is really y = sqrt(snr) * randn(L) + randn(L) and so we can instead do y = rvvar*randn(L) where rvvar= sqrt(1+sqrt(snr)^2) is defined outside the loop (thanks for the fix!). This will halve the number of random variables needed.
Outside the loop you can save sqrt(2/L) to cut down a little bit of time.
I don't think transpose is special-cased yet, so try using dot(y,y) instead of y'*y. I know dot for sure is just a loop without having to transpose, while the other may transpose depending on the version of Julia.
Something that would help performance (but not allocations) would be to use one big randn(L,iters) and loop through that. The reason is because if you make all of your random numbers all at once it's faster since it can use SIMD and a bunch of other goodies. If you want to implicitly do that without changing your code much, you can use ChunkedArrays.jl where you can use rands = ChunkedArray(randn,L) to initialize it and then everytime you want a randn(L), you instead use next(rands). Inside the ChunkedArray it actually makes bigger vectors and replenishes them as needed, but like this you can just get your randn(L) without having to keep track of all of that.
Edit:
ChunkedArrays probably only save time when L is smaller. This gives the code:
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
rvvar= sqrt(1+sqrt(snr)^2)
for m = 1:length(Pf)
i = 0
for k = 1:iters
y = rvvar*randn(L)
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
end
which runs in half the time as using two randn calls. Indeed from the ProfileViewer we get:
#profile pdpf(1000, 10000)
using ProfileView
ProfileView.view()
I circled the two parts for the line y = rvvar*randn(L), so the vast majority of the time is random number generation. Last time I checked you could still get a decent speedup on random number generation by changing to to VSL.jl library, but you need MKL linked to your Julia build. Note that from the Google Summer of Code page you can see that there is a project to make a repo RNG.jl with faster psudo-rngs. It looks like it already has a few new ones implemented. You may want to check them out and see if they give speedups (or help out with that project!)

Spark example program runs very slow

I tried to use Spark to work on simple graph problem. I found an example program in Spark source folder: transitive_closure.py, which computes the transitive closure in a graph with no more than 200 edges and vertices. But in my own laptop, it runs more than 10 minutes and doesn't terminate. The command line I use is: spark-submit transitive_closure.py.
I wonder why spark is so slow even when computing just such small transitive closure result? Is it a common case? Is there any configuration I miss?
The program is shown below, and can be found in spark install folder at their website.
from __future__ import print_function
import sys
from random import Random
from pyspark import SparkContext
numEdges = 200
numVertices = 100
rand = Random(42)
def generateGraph():
edges = set()
while len(edges) < numEdges:
src = rand.randrange(0, numEdges)
dst = rand.randrange(0, numEdges)
if src != dst:
edges.add((src, dst))
return edges
if __name__ == "__main__":
"""
Usage: transitive_closure [partitions]
"""
sc = SparkContext(appName="PythonTransitiveClosure")
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
tc = sc.parallelize(generateGraph(), partitions).cache()
# Linear transitive closure: each round grows paths by one edge,
# by joining the graph's edges with the already-discovered paths.
# e.g. join the path (y, z) from the TC with the edge (x, y) from
# the graph to obtain the path (x, z).
# Because join() joins on keys, the edges are stored in reversed order.
edges = tc.map(lambda x_y: (x_y[1], x_y[0]))
oldCount = 0
nextCount = tc.count()
while True:
oldCount = nextCount
# Perform the join, obtaining an RDD of (y, (z, x)) pairs,
# then project the result to obtain the new (x, z) paths.
new_edges = tc.join(edges).map(lambda __a_b: (__a_b[1][1], __a_b[1][0]))
tc = tc.union(new_edges).distinct().cache()
nextCount = tc.count()
if nextCount == oldCount:
break
print("TC has %i edges" % tc.count())
sc.stop()
There can many reasons why this code doesn't perform particularly well on your machine but most likely this is just another variant of the problem described in Spark iteration time increasing exponentially when using join. The simplest way to check if it is indeed the case is to provide spark.default.parallelism parameter on submit:
bin/spark-submit --conf spark.default.parallelism=2 \
examples/src/main/python/transitive_closure.py
If not limited otherwise, SparkContext.union, RDD.join and RDD.union set a number of partitions of the child to the total number of partitions in the parents. Usually it is a desired behavior but can become extremely inefficient if applied iteratively.
The useage says the command line is
transitive_closure [partitions]
Setting default parallelism will only help with the joins in each partition, not the inital distribution of work.
Im going to argue that that MORE partitions should be used. Setting the default parallelism may still help, but the code you posted sets the number explicitly (the argument passed or 2, whichever is greater). The absolute minimum should be the cores available to Spark, otherwise you're always working at less than 100%.

Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

Resources