How to improve graph centrality calculation parallel performance? - performance

I am getting performance degradation after parallelizing the code that is calculating graph centrality. Graph is relatively large, 100K vertices. Single threaded application take approximately 7 minutes. As recommended on julialang site (http://julia.readthedocs.org/en/latest/manual/parallel-computing/#man-parallel-computing) I adapted code and used pmap api in order to parallelize calculations. I started calculation with 8 processes (julia -p 8 calc_centrality.jl). To my surprise I got 10 fold slow down. Parallel process now take more than hour. I noticed that it take several minutes for parallel process to initialize and starts calculation. Even after all 8 CPUs are %100 busy with julia app, calculation is super slow.
Any suggestion how to improve parallel performance is appreciated.
calc_centrality.jl :
using Graphs
require("read_graph.jl")
require("centrality_mean.jl")
function main()
file_name = "test_graph.csv"
println("graph creation: ", file_name)
g = create_generic_graph_from_file(file_name)
println("num edges: ", num_edges(g))
println("num vertices: ", num_vertices(g))
data = cell(8)
data[1] = {g, 1, 2500}
data[2] = {g, 2501, 5000}
data[3] = {g, 5001, 7500}
data[4] = {g, 7501, 10000}
data[5] = {g, 10001, 12500}
data[6] = {g, 12501, 15000}
data[7] = {g, 15001, 17500}
data[8] = {g, 17501, 20000}
cm = pmap(centrality_mean, data)
println(cm)
end
println("Elapsed: ", #elapsed main(), "\n")
centrality_mean.jl
using Graphs
function centrality_mean(gr, start_vertex)
centrality_cnt = Dict()
vertex_to_visit = Set()
push!(vertex_to_visit, start_vertex)
cnt = 0
while !isempty(vertex_to_visit)
next_vertex_set = Set()
for vertex in vertex_to_visit
if !haskey(centrality_cnt, vertex)
centrality_cnt[vertex] = cnt
for neigh in out_neighbors(vertex, gr)
push!(next_vertex_set, neigh)
end
end
end
cnt += 1
vertex_to_visit = next_vertex_set
end
mean([ v for (k,v) in centrality_cnt ])
end
function centrality_mean(data::Array{})
gr = data[1]
v_start = data[2]
v_end = data[3]
n = v_end - v_start + 1;
cm = Array(Float64, n)
v = vertices(gr)
cnt = 0
for i = v_start:v_end
cnt += 1
if cnt%10 == 0
println(cnt)
end
cm[cnt] = centrality_mean(gr, v[i])
end
return cm
end

I'm guessing this has nothing to do with parallelism. Your second centrality_mean method has no clue what type of objects gr, v_start, and v_end are. So it's going to have to use non-optimized, slow code for that "outer loop."
While there are several potential solutions, probably the easiest is to break up your function that receives the "command" from pmap:
function centrality_mean(data::Array{})
gr = data[1]
v_start = data[2]
v_end = data[3]
centrality_mean(gr, v_start, v_end)
end
function centrality_mean(gr, v_start, v_end)
n = v_end - v_start + 1;
cm = Array(Float64, n)
v = vertices(gr)
cnt = 0
for i = v_start:v_end
cnt += 1
if cnt%10 == 0
println(cnt)
end
cm[cnt] = centrality_mean(gr, v[i])
end
return cm
end
All this does is create a break, and gives julia a chance to optimize the second part (which contains the performance-critical loop) for the actual types of the inputs.

Below is code with #everywhere suggested in comments by https://stackoverflow.com/users/1409374/rickhg12hs. That fixed performance issue!!!
test_parallel_pmap.jl
using Graphs
require("read_graph.jl")
require("centrality_mean.jl")
function main()
#everywhere file_name = "test_data.csv"
println("graph creation from: ", file_name)
#everywhere data_graph = create_generic_graph_from_file(file_name)
#everywhere data_graph_vertex = vertices(data_graph)
println("num edges: ", num_edges(data_graph))
println("num vertices: ", num_vertices(data_graph))
range = cell(2)
range[1] = {1, 25000}
range[2] = {25001, 50000}
cm = pmap(centrality_mean_pmap, range)
for i = 1:length(cm)
println(length(cm[i]))
end
end
println("Elapsed: ", #elapsed main(), "\n")
centrality_mean.jl
using Graphs
function centrality_mean(start_vertex::ExVertex)
centrality_cnt = Dict{ExVertex, Int64}()
vertex_to_visit = Set{ExVertex}()
push!(vertex_to_visit, start_vertex)
cnt = 0
while !isempty(vertex_to_visit)
next_vertex_set = Set()
for vertex in vertex_to_visit
if !haskey(centrality_cnt, vertex)
centrality_cnt[vertex] = cnt
for neigh in out_neighbors(vertex, data_graph)
push!(next_vertex_set, neigh)
end
end
end
cnt += 1
vertex_to_visit = next_vertex_set
end
mean([ v for (k,v) in centrality_cnt ])
end
function centrality_mean(v_start::Int64, v_end::Int64)
n = v_end - v_start + 1;
cm = Array(Float64, n)
cnt = 0
for i = v_start:v_end
cnt += 1
cm[cnt] = centrality_mean(data_graph_vertex[i])
end
return cm
end
function centrality_mean_pmap(range::Array{})
v_start = range[1]
v_end = range[2]
centrality_mean(v_start, v_end)
end

From the Julia page on parallel computing:
Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processes in separate memory domains at once.
If I interpret this right, Julia's parallelism requires message passing to synchonize the processess. If each individual process does only a little work, and then does a message-pass, the computation will be dominated by message-passing overhead and not doing any work.
I can't see from your code, and I don't know Julia well enough, to see where the parallelism breaks are. But you have a big complicated graph that may be spread willy-nilly across multiple processes. If they need to interact on walking across graph links, you'll have exactly that kind of overhead.
You may be able to fix it by precomputing a partition of the graph into roughly equal size, highly cohesive regions. I suspect that breakup requires that same type of complex graph processing that you already want to do, so you may have a chicken and egg problem to boot.
It may be that Julia offers you the wrong parallelism model. You might want a shared address space so that threads walking the graph dont' have to use messages to traverse arcs.

Related

Speeding up simulation of the Levy motion algorithm

Here is my little script for simulating Levy motion:
clear all;
clc; close all;
t = 0; T = 1000; I = T-t;
dT = T/I; t = 0:dT:T; tau = T/I;
alpha = 1.5;
sigma = dT^(1/alpha);
mu = 0; beta = 0;
N = 1000;
X = zeros(N, length(I));
for k=1:N
L = zeros(1,I);
for i = 1:I-1
L( (i + 1) * tau ) = L(i*tau) + stable2( alpha, beta, sigma, mu, 1);
end
X(k,1:length(L)) = L;
end
q = 0.1:0.1:0.9;
quant = qlines2(X, q, t(1:length(X)), tau);
hold all
for i = 1:length(quant)
plot( t, quant(i) * t.^(1/alpha), ':k' );
end
Where stable2 returns a stable random variable with given parameters (you may replace it with normrnd(mu, sigma) for this case, it's not crucial); qlines2 returns quantiles needed for plotting.
But I don't want to talk about math here. My problem is that this implementation is pretty slow, and I would like to speed it up. Unfortunately, computer science is not my main field - I heard something about methods like memoization, vectorization and that there is a lot of other techniques, but I don't know how to use them.
For example, I'm pretty sure I should replace this filthy double for-loop somehow, but I'm not sure what to do instead.
EDIT: Maybe I should use (and learn...) another language (Python, C, any functional one)? I always though that Matlab/OCTAVE is designed for numerical computation, but if change, then for which one?
The crucial bit is, as you said, the for loops, Matlab does not like those, so vectorization is indeed the keyword. (Together with preallocating the space.
I just altered you for loop section somewhat so that you do not have to reset L over and over again, instead we save all Ls in a bigger matrix (also I elimiated the length(L) command).
L = zeros(N,I);
for k=1:N
for i = 1:I-1
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
X(k,1:I) = L(k,1:I);
end
Now you can already see that X(k,1:I) = L(k,1:I); in the loop is obsolete and that also means that we can switch the order of the loops. This is crucial, because the i-steps are recursive (depend on the previous step) that means we cannot vectorize this loop, we can only vectorize the k-loop.
Now your original code needed 9.3 seconds on my machine, the new code still needs about the same time)
L = zeros(N,I);
for i = 1:I-1
for k=1:N
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
end
X = L;
But now we can apply the vectorization, instead of looping throu all rows (the loop over k) we can instead eliminate this loop, and doing all rows at "once".
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma); %<- this is not yet what you want, see comment below
end
X = L;
This code need only 0.045 seconds on my machine. I hope you still get the same output, because I have no idea what you are calculating, but I also hope you could see how you go about vectorizing code.
PS: I just noticed that we now use the same random number in the last example for the whole column, this is obviously not what you want. Instad you should generate a whole vector of random numbers, e.g:
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma,N,1);
end
X = L;
PPS: Great question!

Why is my Julia shared array code running so slow?

I'm trying to implement Smith-Waterman alignment in parallel using Julia (see: Figure 1 of http://www.cs.virginia.edu/~rl6sf/paper_dump/2011:12:33:22.pdf), but the algorithm is running much slower in Julia than the serial version. I'm using shared arrays to do this and figure I am doing something silly that is making the code run slow. Could someone take a look and see if my code is optimized as possible? The parallel version should run faster than in serial….
The basic concept of it is to compute the anti-diagonal elements of a matrix in parallel from the upper left to lower right corner and to update them. I'm trying to use 32 cores on a shared array machine to do this. I have a SharedArray matrix that I am using to do this and am computing the elements of each anti-diagonal in parallel as shown below. The while loops in the spSW function submit tasks to workers in sync for each anti-diagonal using the helper function shared_get_score(). The main goal of this function is to fill in each element in the shared arrays "matrix" and "path".
function spSW(seq1,seq2,p)
indel = -1
match = 2
seq1 = "^$seq1"
seq2 = "^$seq2"
col = length(seq1)
row = length(seq2)
wl = workers()
matrix,path = shared_initialize_path(seq1,seq2)
for j = 2:col
jcol = j
irow = 2
#sync begin
count = 0
while jcol > 1 && irow < row + 1
#println(j," ",irow," ",jcol)
if seq1[jcol] == seq2[irow]
equal = true
else
equal = false
end
w = wl[(count % p) + 1]
#async remotecall_wait(w,shared_get_score!,matrix,path,equal,indel,match,irow,jcol)
jcol -= 1
irow += 1
count += 1
end
end
end
for i = 3:row
jcol = col
irow = i
#sync begin
count = 0
while irow < row+1 && jcol > 1
#println(j," ",irow," ",jcol)
if seq1[jcol] == seq2[irow]
equal = true
else
equal = false
end
w = wl[(count % p) + 1]
#async remotecall_wait(w,shared_get_score!,matrix,path,equal,indel,match,irow,jcol)
jcol -= 1
irow += 1
count += 1
end
end
end
return matrix,path
end
The other helper functions are:
function shared_initialize_path(seq1,seq2)
col = length(seq1)
row = length(seq2)
matrix = convert(SharedArray,fill(0,(row,col)))
path = convert(SharedArray,fill(0,(row,col)))
return matrix,path
end
#everywhere function shared_get_score!(matrix,path,equal,indel,match,i,j)
pathvalscode = ["-","|","M"]
pathvals = [1,2,3]
scores = []
push!(scores,matrix[i,j-1]+indel)
push!(scores,matrix[i-1,j]+indel)
if equal
push!(scores,matrix[i-1,j-1]+match)
else
push!(scores,matrix[i-1,j-1]+indel)
end
val,ind = findmax(scores)
if val < 0
matrix[i,j] = 0
else
matrix[i,j] = val
end
path[i,j] = pathvals[ind]
end
Does anyone see an obvious way to make this run faster? Right now it's about 10 times slower than the serial version.

How can I vectorize these nested for-loops in Matlab?

I have a piece of code here I need to streamline as it is greatly increasing the runtime of my script:
size=300;
resultLength = (size+1)^3;
freqResult=zeros(1, resultLength);
inc=1;
for i=0:size,
for j=0:size,
for k=0:size,
freqResult(inc)=(c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
inc=inc+1;
end
end
end
c, L, W, and H are all constants. As the size input gets over about 400, the runtime is too long to wait for, and I can watch my disk space draining by the gigabyte. Any advice?
Thanks!
What about this:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
for indx = 1:numel(iT)
i = iT(indx) - 1;
j = jT(indx) - 1;
k = kT(indx) - 1;
freqResult1(indx) = (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
end
On my PC, for size = 400, version with 3 loops takes 136s and this one takes 19s.
For more "matlaby" way u could also even do as follows:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
func = #(i, j, k) (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
freqResult2 = arrayfun(func, iT-1, jT-1, kT-1);
But for some reason, this is slower then the above version.
A faster solution can be (based on Marcin's answer):
[k, j, i] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
freqResult = (c/2)*sqrt(((i-1)/L).^2+((j-1)/W).^2+((k-1)/H).^2);
It takes about 5 seconds to run on my PC for size = 300
The following is even faster (but it doesn't look very good):
k = repmat(0:size,[1 (size+1)^2]);
j = repmat(kron(0:size, ones(1,size+1)),[1 (size+1)]);
i = kron(0:size, ones(1,(size+1)^2));
freqResult = (c/2)*sqrt((i/L).^2+(j/W).^2+(k/H).^2);
which takes ~3.5s for size = 300

Matlab - Speeding up Nested For-Loops

I'm working on a function with three nested for loops that is way too slow for its intended use. The bottleneck is clearly the looping part - almost 100 % of the execution time is spent in the innermost loop.
The function takes a 2d matrix called rM as input and returns a 3d matrix called ec:
rows = size(rM, 1);
cols = size(rM, 2);
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols,
for r = 2:rows+1,
ec(r, c, risk) = ec(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
Any help on speeding up the for loops would be appreciated...
The problem is, that the inner loop is slowest, while it is also near-impossible to vectorize. As every iteration directly depends on the previous one.
The outer two are possible:
clc;
rM = rand(50);
rows = size(rM, 1);
cols = size(rM, 2);
minRisk = 1;
stepRisk = 1;
maxRisk = 100;
numRiskLevels = maxRisk/stepRisk;
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
riskArray = (minRisk:stepRisk:maxRisk)';
tic
for r = 2:rows+1
tmp = riskArray * rM(r-1, :);
tmp = permute(tmp, [3 2 1]);
ec(r, :, :) = ec(r-1, :, :) .* (1 + tmp);
end
toc
%preallocate.
ec2 = zeros(rows+1, cols, numRiskLevels);
ec2(1, :, :) = 100;
tic
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols
for r = 2:rows+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
toc
all(all(all(ec == ec2)))
But to my surprise, the vectorized code is indeed slower. (But maybe someone can improve the code, so I figured I leave it her for you.)
I have just tried to vectorize the outer loop, and actually noticed a significant speed increase. Of course it is hard to judge the speed of a script without knowing (the size of) the inputs but I would say this is a good starting point:
% Here you can change the input parameters
riskVec = 1:3:120;
rM = rand(50);
%preallocate and calculate non vectorized solution
ec2 = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec2(1, :, :) = 100;
tic
for risk = riskVec
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
t1=toc;
%preallocate and calculate vectorized solution
ec = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec(1, :, :) = 100;
tic
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec(r, c, riskVec) = ec(r-1, c, riskVec) .* reshape(1 + riskVec * rM(r-1, c),[1 1 length(riskVec)]);
end
end
t2=toc;
% Check whether the vectorization is done correctly and show the timing results
if ec(:) == ec2(:)
t1
t2
end
The given output is:
t1 =
0.1288
t2 =
0.0408
So for this riskVec and rM it is about 3 times as fast as the non-vectorized solution.

Averaging Matlab matrix

In the Matlab programs I use I often have to average within a matrix (interpolation). The most straightforward way is to add the matrix and a shifted one (avg). However you could do the same operation using matrix multiplication (avg2). I noticed a considerable speed increase in the case of using matrix multiplication in the case of large matrices.
Could anyone explain why Matlab is able to process this multiplication faster than adding the same matrix? Also what are the possible downsides of using avg2() in respect to avg()?
Difference in runtime was a factor ~6 for this case (n=500).
function [] = speed()
%Speed test for averaging a matrix
n = 500;
A = rand(n,n);
tic
for i=1:100
avg(A);
end
toc
tic
for i=1:100
avg2(A);
end
toc
end
function B = avg(A,k)
if nargin<2, k = 1; end
if size(A,1)==1, A = A'; end
if k<2, B = (A(2:end,:)+A(1:end-1,:))/2; else B = avg(A,k-1); end
if size(A,2)==1, B = B'; end
end
function B = avg2(A,k)
if nargin<2, k = 1; end
if size(A,1)==1, A = A'; end
if k<2,
m = size(A,1);
e = ones(m,1);
S = spdiags(e*[1 1],-1:0,m,m-1)'/2;
B = S*A; else B = avg2(A,k-1); end
if size(A,2)==1, B = B'; end
end
Im afraid I cant give you an answer to the inner workings of the functions you are using. However, as they seem overly complicated, I felt I should make you aware of an easier (and a bit faster) way of doing this averaging.
You can instead use conv2 with a kernel of [0.5;0.5]. I have extended your code below:
function [A, T1, T2 T3] = speed()
%Speed test for averaging a matrix
n = 900;
A = rand(n,n);
tic
for i=1:100
T1 = avg(A);
end
toc
tic
for i=1:100
T2 = avg2(A);
end
toc
tic
for i=1:100
T3 = conv2(A,[1;1]/2,'valid');
end
toc
if sum(sum(abs(T3-T2))) > 0
warning('Method 3 not equal the other methods')
end
end
function B = avg(A,k)
if nargin<2, k = 1; end
if size(A,1)==1, A = A'; end
if k<2, B = (A(2:end,:)+A(1:end-1,:))/2; else B = avg(A,k-1); end
if size(A,2)==1, B = B'; end
end
function B = avg2(A,k)
if nargin<2, k = 1; end
if size(A,1)==1, A = A'; end
if k<2,
m = size(A,1);
e = ones(m,1);
S = spdiags(e*[1 1],-1:0,m,m-1)'/2;
B = S*A; else B = avg2(A,k-1); end
if size(A,2)==1, B = B'; end
end
Results:
Elapsed time is 10.201399 seconds.
Elapsed time is 1.088003 seconds.
Elapsed time is 1.040471 seconds.
Apologies if you already knew this.

Resources