Octave Parallel Code Example - parallel-processing

Can anyone provide an example octave code to submit to the cluster assuming the parallel package for octave is installed in the cluster? What I am saying, do I just use parfor as in Matlab parallel computing toolbox, or use something like the code below?
n = 100000000;
s = 0;
% Initialize MPI
MPI_Init;
% Create MPI communicator
comm= MPI_COMM_WORLD;
% Get size and rank
size = MPI_Comm_size(comm);
rank= MPI_Comm_rank(comm);
master = 0;
% Split the for loop into size partitions
m = n / size;
r = mod(n, m);
if ( rank == size-1 )
se = (rank + 1)*m + r;
else
se = (rank + 1)*m;
end
% Each process works on a partition of the loop independently
s1 = s;
for i=(rank * m)+1:se
s1 = s1 + i;
end
% print the partial summation on each process
disp(['Partial summation: ', num2str(s1), ' on process ', num2str(rank)]);
This code is from a tutorial in the link
http://faculty.cs.tamu.edu/wuxf/talks/IAMCS-ParallelProgramming-2013-3.pdf
for matlab with mpi.
Thanks a lot!

Related

Matlab: How to convert nested far loop into parfor

I am having problems with the following loop, since it is taking too much time. Hence, I would like to use parallel processing, specifically parfor function.
P = numel(scaleX); % quite BIG number
sz = P;
start = 1;
sqrL = 10; % sqr len
e = 200;
A = false(sz, sz);
for m = sz-sqrL/2:(-1)*sqrL:start
for n = M(m):-sqrL:1
temp = [scaleX(m), scaleY(m); scaleX(n), scaleY(n)];
d = pdist(temp, 'euclidean');
if d < e
A(m, n) = 1;
end
end
end
Can anyone, please, help me to convert the outer 'far' loop into 'parfor' in this code?

MATLAB program takes more than 1 hour to execute

The below program is a program for finding k-clique communities from a input graph.
The graph dataset can be found here.
The first line of the dataset contains 'number of nodes and edges' respectively. The following lines have 'node1 node2' representing an edge between node1 and node2 .
For example:
2500 6589 // number_of_nodes, number_of_edges
0 5 // edge between node[0] and node[5]
.
.
.
The k-clique( aCliqueSIZE, anAdjacencyMATRIX ) function is contained here.
The following commands are executed in command window of MATLAB:
x = textread( 'amazon.graph.small' ); %% source input file text
s = max(x(1,1), x(1,2)); %% take largest dimemsion
adjMatrix = sparse(x(2:end,1)+1, x(2:end,2)+1, 1, s, s); %% now matrix is square
adjMatrix = adjMatrix | adjMatrix.'; %% apply "or" with transpose to make symmetric
adjMatrix = full(adjMatrix); %% convert to full if needed
k=4;
[X,Y,Z]=k_clique(k,adjMatrix); %%
% The output can be viewed by the following commands
celldisp(X);
celldisp(Y);
Z
The above program takes more than 1 hour to execute whereas I think this shouldn't be the case. While running the program on windows, I checked the task manager and found that only 500 MB is allocated for the program. Is this the reason for the slowness of the program? If yes, then how can I allocate more heap memory (close to 4GB) to this program in MATLAB?
The problem does not seem to be Memory-bound
Having a sparse, square, symmetric matrix of 6k5 * 6k5 edges does not mean a big memory.
The provided code has many for loops and is heavily recursive in the tail function transfer_nodes()
Add a "Stone-Age-Profiler" into the code
To show the respective times spent on a CPU-bound sections of the processing, wrap the main sections of the code into a construct of:
tic(); for .... end;toc()
which will print you the CPU-bound times spent on relevent sections of the k_clique.m code, showing the readings "on-the-fly"
Your original code k_clique.m
function [components,cliques,CC] = k_clique(k,M)
% k-clique algorithm for detecting overlapping communities in a network
% as defined in the paper "Uncovering the overlapping
% community structure of complex networks in nature and society"
%
% [X,Y,Z] = k_clique(k,A)
%
% Inputs:
% k - clique size
% A - adjacency matrix
%
% Outputs:
% X - detected communities
% Y - all cliques (i.e. complete subgraphs that are not parts of larger
% complete subgraphs)
% Z - k-clique matrix
nb_nodes = size(M,1); % number of nodes
% Find the largest possible clique size via the degree sequence:
% Let {d1,d2,...,dk} be the degree sequence of a graph. The largest
% possible clique size of the graph is the maximum value k such that
% dk >= k-1
degree_sequence = sort(sum(M,2) - 1,'descend');
%max_s = degree_sequence(1);
max_s = 0;
for i = 1:length(degree_sequence)
if degree_sequence(i) >= i - 1
max_s = i;
else
break;
end
end
cliques = cell(0);
% Find all s-size kliques in the graph
for s = max_s:-1:3
M_aux = M;
% Looping over nodes
for n = 1:nb_nodes
A = n; % Set of nodes all linked to each other
B = setdiff(find(M_aux(n,:)==1),n); % Set of nodes that are linked to each node in A, but not necessarily to the nodes in B
C = transfer_nodes(A,B,s,M_aux); % Enlarging A by transferring nodes from B
if ~isempty(C)
for i = size(C,1)
cliques = [cliques;{C(i,:)}];
end
end
M_aux(n,:) = 0; % Remove the processed node
M_aux(:,n) = 0;
end
end
% Generating the clique-clique overlap matrix
CC = zeros(length(cliques));
for c1 = 1:length(cliques)
for c2 = c1:length(cliques)
if c1==c2
CC(c1,c2) = numel(cliques{c1});
else
CC(c1,c2) = numel(intersect(cliques{c1},cliques{c2}));
CC(c2,c1) = CC(c1,c2);
end
end
end
% Extracting the k-clique matrix from the clique-clique overlap matrix
% Off-diagonal elements <= k-1 --> 0
% Diagonal elements <= k --> 0
CC(eye(size(CC))==1) = CC(eye(size(CC))==1) - k;
CC(eye(size(CC))~=1) = CC(eye(size(CC))~=1) - k + 1;
CC(CC >= 0) = 1;
CC(CC < 0) = 0;
% Extracting components (or k-clique communities) from the k-clique matrix
components = [];
for i = 1:length(cliques)
linked_cliques = find(CC(i,:)==1);
new_component = [];
for j = 1:length(linked_cliques)
new_component = union(new_component,cliques{linked_cliques(j)});
end
found = false;
if ~isempty(new_component)
for j = 1:length(components)
if all(ismember(new_component,components{j}))
found = true;
end
end
if ~found
components = [components; {new_component}];
end
end
end
function R = transfer_nodes(S1,S2,clique_size,C)
% Recursive function to transfer nodes from set B to set A (as
% defined above)
% Check if the union of S1 and S2 or S1 is inside an already found larger
% clique
found_s12 = false;
found_s1 = false;
for c = 1:length(cliques)
for cc = 1:size(cliques{c},1)
if all(ismember(S1,cliques{c}(cc,:)))
found_s1 = true;
end
if all(ismember(union(S1,S2),cliques{c}(cc,:)))
found_s12 = true;
break;
end
end
end
if found_s12 || (length(S1) ~= clique_size && isempty(S2))
% If the union of the sets A and B can be included in an
% already found (larger) clique, the recursion is stepped back
% to check other possibilities
R = [];
elseif length(S1) == clique_size;
% The size of A reaches s, a new clique is found
if found_s1
R = [];
else
R = S1;
end
else
% Check the remaining possible combinations of the neighbors
% indices
if isempty(find(S2>=max(S1),1))
R = [];
else
R = [];
for w = find(S2>=max(S1),1):length(S2)
S2_aux = S2;
S1_aux = S1;
S1_aux = [S1_aux S2_aux(w)];
S2_aux = setdiff(S2_aux(C(S2(w),S2_aux)==1),S2_aux(w));
R = [R;transfer_nodes(S1_aux,S2_aux,clique_size,C)];
end
end
end
end
end

Vectorizing nested loops in matlab using bsxfun and with GPU

For loops seem to be extremely slow, so I was wondering if the nested loops in the code shown next could be vectorized using bsxfun and maybe GPU could be introduced too.
Code
%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;
%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;
%// Nested Loops - I
for x = 1:n1
for y = 1:n1
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
LInc(x, y) = L1(x, y) + (num/denom);
LInc(y, x) = LInc(x, y);
end
end
%// Nested Loops - II
for x = 1:n1
for y = 1:n2
num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
LInc(x, n1+y) = num/denom;
LInc(n1+y, x) = LInc(x, n1+y);
end
end
Edit 1: n and denom could be assumed as constants too.
Here are vectorized CPU and GPU codes and I am hoping that I am using at least good practices for the GPU code and the benchmarking later on.
CPU Code
%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;
%// Calculate num/denom value for stage 1 and 2
nd1 = L1 + (((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - n2*n*bsxfun(#plus,L1(:,i),L1(:,i).'))./denom; %//'
nd2 = (bsxfun(#plus,n1*n*L1(:,i),n2*n*L2(:,j).') - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1))))./denom; %//'
%// Plug in the values in the output matrix
LInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
LInc(n1+1:end,1:n1) = nd2.'; %//'
LInc(1:n1,n1+1:end) = nd2;
GPU Code
%// Pre-allocate for output
gLInc = zeros(n1+n2,n1+n2,'gpuArray');
%// Convert to gpu arrays
gL1 = gpuArray(L1);
gL2 = gpuArray(L2);
%// Calculate num/denom value for stage 1 and 2
nd1 = gL1 + (((n2 ^ 2) * (gL1(i, i) + gL2(j, j) + 1)) - n2*n*bsxfun(#plus,gL1(:,i),gL1(:,i).'))./denom; %//'
nd2 = (bsxfun(#plus,n1*n*gL1(:,i),n2*n*gL2(:,j).') - ((n1 * n2 * (gL1(i, i) + gL2(j, j) + 1))))./denom; %//'
%// Plug in the values in the output matrix
gLInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
gLInc(n1+1:end,1:n1) = nd2.'; %//'
gLInc(1:n1,n1+1:end) = nd2;
%// Gather data from GPU back to CPU
LInc = gather(gLInc);
Benchmarking
GPU benchmarking tips were taken from Measure and Improve GPU Performance.
%// Warm up GPU call with insignificant small scalar inputs, just in case
%// gputimeit doesn't do the same
temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code
i = 1;
j = 3;
n = 1000; %// Assumed
denom = 1e6; %// Assumed
N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
timeall = zeros(3,numel(N_arr));
for k1 = 1:numel(N_arr)
N = N_arr(k1);
n1 = N; %// n1, n2 are assumed identical for less-complicated benchmarking
n2 = N;
L1 = rand(n1,n1);
L2 = rand(n2,j);
f = #() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
timeall(1,k1) = timeit(f);
clear f
f = #() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
timeall(2,k1) = timeit(f);
clear f
f = #() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
timeall(3,k1) = gputimeit(f);
clear f
end
%// Display benchmark results
figure,hold on, grid on
plot(N_arr,timeall(1,:),'-b.')
plot(N_arr,timeall(2,:),'-ro')
plot(N_arr,timeall(3,:),'-kx')
legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')
Results
Conclusions
Results show that the vectorized GPU code performs really well with higher datasize and goes from slower than both the vectorized CPU and original code to being twice as fast as the vectorized CPU code.
If you have not done so, you should preallocate LInc.
LInc = zeros(n1,n2);
If you want to vectorize it, you don't need to use bsxfun to vectorize your code. I think you can do something like
x = 1:n1;
y = 1:n1;
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
LInc(x, y) = L1(x, y) + (num/denom);
However, this code is confusing to me because as it is, you are overwriting the value of LInc several times. Without knowing what your goal is its hard for me to help more. The above code probably will not return the same values as your function.

Optimising multidimensional array performance- MATLAB

Communication overhead (parfor) and preallocating for speed (for) in Multidimensional Arrays
I am getting two warnings in the following script at the places indicated by **'s
Variable is indexed but not sliced... (the array A shown by ** in the second parfor loop) - What is causing this and how can it be avoided?
The variable appears to change size on every loop... (the array Sol shown by ** in the for loop) - Maybe I am not doing it right, but preallocating memory hasn't worked.
Edit: My initial idea was to preallocate the arrays (as done in the first parfor loop) so that it will execute the rest of the script faster (the full version of the script repeats various array operations similar to the second parfor and for loops).
Any suggestions? :)
N = 1000;
parfor i=1:N
A(:,:,i) = rand(2);
X(:,:,i) = rand(2,1);
Sol1(1,1,i) = zeros();
Sol2(1,1,i) = zeros();
Sol(2,1,i) = zeros();
end
t0 = tic;
parfor i=1:N
Sol1(1,:,i) = A(1,:,i)*X(:,1,i);
Sol2(1,:,i) = **A**(2,:,i)*X(:,1,i);
end
for i=1:N
**Sol**(:,1,i) = [Sol1(1,:,i);Sol2(1,:,i)];
end
toc(t0);
Your pre-allocation is not right - you need to do each in a single call.
A = rand(2, 2, N);
X = rand(2, 1, N);
Sol1 = zeros(1, 1, N);
Sol2 = zeros(1, 1, N);
Sol = zeros(2, 1, N); % not really needed actually.
In your PARFOR loop, you can avoid 'broadcasting' A by using a syntax that MATLAB understands as slicing
parfor i = 1:N
tmp = A(:, :, i);
Sol1(1, :, i) = tmp(1,:) * X(:, 1, i);
Sol2(1, :, i) = tmp(2,:) * X(:, 1, i);
end
Finally, I think you can do this as a vectorised concatenation like so:
Sol = [Sol1; Sol2];
EDIT
On the GPU, you can use pagefun to get the whole job done in a single call, like so:
Ag = gpuArray.rand(2,2,N);
Xg = gpuArray.rand(2,1,N);
Sol = pagefun(#mtimes, Ag, Xg);

How can I vectorize these nested for-loops in Matlab?

I have a piece of code here I need to streamline as it is greatly increasing the runtime of my script:
size=300;
resultLength = (size+1)^3;
freqResult=zeros(1, resultLength);
inc=1;
for i=0:size,
for j=0:size,
for k=0:size,
freqResult(inc)=(c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
inc=inc+1;
end
end
end
c, L, W, and H are all constants. As the size input gets over about 400, the runtime is too long to wait for, and I can watch my disk space draining by the gigabyte. Any advice?
Thanks!
What about this:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
for indx = 1:numel(iT)
i = iT(indx) - 1;
j = jT(indx) - 1;
k = kT(indx) - 1;
freqResult1(indx) = (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
end
On my PC, for size = 400, version with 3 loops takes 136s and this one takes 19s.
For more "matlaby" way u could also even do as follows:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
func = #(i, j, k) (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
freqResult2 = arrayfun(func, iT-1, jT-1, kT-1);
But for some reason, this is slower then the above version.
A faster solution can be (based on Marcin's answer):
[k, j, i] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
freqResult = (c/2)*sqrt(((i-1)/L).^2+((j-1)/W).^2+((k-1)/H).^2);
It takes about 5 seconds to run on my PC for size = 300
The following is even faster (but it doesn't look very good):
k = repmat(0:size,[1 (size+1)^2]);
j = repmat(kron(0:size, ones(1,size+1)),[1 (size+1)]);
i = kron(0:size, ones(1,(size+1)^2));
freqResult = (c/2)*sqrt((i/L).^2+(j/W).^2+(k/H).^2);
which takes ~3.5s for size = 300

Resources