I'm solving a PDE using an implicit scheme, which I can divide into two matrices at every time step, that are then connected by a boundary condition (also at every time step). I'm trying to speed up the process by using multi-processing to invert both matrices at the same time.
Here's an example of what this looks like in a minimal (non-PDE-solving) example.
using Distributed
using LinearAlgebra
function backslash(N, T, b, exec)
A = zeros(N,N)
α = 0.1
for i in 1:N, j in 1:N
abs(i-j)<=1 && (A[i,j]+=-α)
i==j && (A[i,j]+=3*α+1)
end
A = Tridiagonal(A)
a = zeros(N, 4, T)
if exec == "parallel"
for i = 1:T
#distributed for j = 1:2
a[:, j, i] = A\b[:, i]
end
end
elseif exec == "single"
for i = 1:T
for j = 1:2
a[:, j, i] = A\b[:, i]
end
end
end
return a
end
b = rand(1000, 1000)
a_single = #time backslash(1000, 1000, b, "single");
a_parallel = #time backslash(1000, 1000, b, "parallel");
a_single == a_parallel
Here comes the problem: the last line evaluate to true, with an 6-fold speed-up, however, only 2-fold should be possible. What am I getting wrong?
You are measuring compile time
Your #distributed loop exits prematurely
Your #distributed loop does not collect the results
Hence obviously you have:
julia> addprocs(2);
julia> sum(backslash(1000, 1000, b, "single")), sum(backslash(1000, 1000, b, "parallel"))
(999810.3418359067, 0.0)
So in order to make your code work you need to collect the data from the distributed loop which can be done as:
function backslash2(N, T, b, exec)
A = zeros(N,N)
α = 0.1
for i in 1:N, j in 1:N
abs(i-j)<=1 && (A[i,j]+=-α)
i==j && (A[i,j]+=3*α+1)
end
A = Tridiagonal(A)
a = zeros(N, 4, T)
if exec == :parallel
for i = 1:T
aj = #distributed (append!) for j = 1:2
[A\b[:, i]]
end
# you could consider using SharedArrays instead
a[:, 1, i] .= aj[1]
a[:, 2, i] .= aj[2]
end
elseif exec == :single
for i = 1:T
for j = 1:2
a[:, j, i] = A\b[:, i]
end
end
end
return a
end
Now you have equal results:
julia> sum(backslash2(1000, 1000, b, :single)) == sum(backslash2(1000, 1000, b, :parallel))
true
However, the distributed code is very inefficient for loops that take few milliseconds to execute so the #distributed code will take in this example many times longer to execute as you run it 1000 times and it takes something like few milliseconds to dispatch a distributed job each time.
Perhaps your production task takes longer so it than makes sense. Or maybe you will consider Threads.#threads instead.
Last but not least BLAS might be configured to be multi-threaded and in this scenario on a single machine it might make no sense to parallelize (depends on the scenario)
Related
Suppose that I have a probability transition matrix, say a matrix of dimensions 2000x2000, that represents a homogeneous Markov chain, and I want to get some statistics of each probability distribution of the first 200 steps of the chain (the distribution of the first row at each step), then I've written the following
using Distributions, LinearAlgebra
# This function defines our transition matrix:
function tm(N::Int, n0::Int)
[pdf(Hypergeometric(N-l,l,n0),k-l) for l in 0:N, k in 0:N]
end
# This computes the 5-percentile of a probability vector
function percentile5(M::Vector)
s=0
i=0
while s <= 0.05
i += 1
s += M[i]
end
return i-1
end
# This function compute a matrix with three rows: means, 5-percentiles
# and standard deviations. Each column represent a session.
function stats(N::Int, n0::Int, m::Int)
A = tm(N,n0)
B = I # Initilizing B with the identity matrix
sup = 0:N # The support of each distribution
sup2 = [k^2 for k in sup]
stats = zeros(3,m)
for i in 1:m
C = B[1,:]
stats[1,i] = sum(C .* sup) # Mean
stats[2,i] = percentile5(C) # 5-percentile
stats[3,i] = sqrt(sum(C .* sup2) - stats[1,i]^2) # Standard deviation
B = A*B
end
return stats
end
data = stats(2000,50,200)
My question is, there is a more efficient (faster) way to do the same computation? I don't see a better way to do it but maybe there are some tricks that speed-up this computation.
This is what I have running so far:
using Distributions, LinearAlgebra, SparseArrays
# This function defines our transition matrix:
function tm(N::Int, n0::Int)
[pdf(Hypergeometric(N-l,l,n0),k-l) for l in 0:N, k in 0:N]
end
# This computes the 5-percentile of a probability vector
function percentile5(M::AbstractVector)
s = zero(eltype(M))
res = length(M)
#inbounds for i = 1:length(M)
s += M[i]
if s > 0.05
res = i - 1
break
end
end
return res
end
# This function compute a matrix with three rows: means, 5-percentiles
# and standard deviations. Each column represent a session.
function stats(N::Int, n0::Int, m::Int)
A = sparse(transpose(tm(N, n0)))
C = zeros(size(A, 1))
C[1] = 1.0
sup = 0:N # The support of each distribution
sup2 = sup .^ 2
stats = zeros(3, m)
for i = 1:m
stats[1, i] = sum(C .* sup) # Mean
stats[2, i] = percentile5(C) # 5-percentile
stats[3, i] = sqrt(sum(C .* sup2) - stats[1, i]^2) # Standard deviation
C = A * C
end
return stats
end
It is around 4x faster (on smaller parameters - possibly much more speedup on large parameters). Basically uses the tips I've made in the comment:
using sparse arrays.
avoiding whole matrix multiply but using vector-matrix multiply instead.
Further improvement are possible (like simulation/ensemble method I've mentioned).
I have 2 functions:
ccexpan - which calculates coefficients of interpolating polynomial of function f with N nodes in Chebyshew polynomial of the first kind basis.
csum - calculates value for arguments t using coefficients c from ccexpan (using Clenshaw algorithm).
This is what I have written so far:
function c = ccexpan(f,N)
z = zeros (1,N+1);
s = zeros (1,N+1);
for i = 1:(N+1)
z(i) = pi*(i-1)/N;
end
t = f(cos(z));
for k = 1:(N+1)
s(k) = sum(t.*cos(z.*(k-1)));
s(k) = s(k)-(f(1)+f(-1)*cos(pi*(k-1)))/2;
end
c = s.*2/N;
and:
function y = csum(t,c)
M = length(t);
N = length(c);
y = t;
b = zeros(1,N+2);
for k = 1:M
for i = N:-1:1
b(i) = c(i)+2*t(k)*b(i+1)-b(i+2);
end
y(k)=(b(1)-b(3))/2;
end
Unfortunately these programs are very slow, and also slightly inacurrate. Please give me some tips on how to speed them up, and how to improve accuracy.
Where possible try to get away from looping structures. At first blush, I would trade out your first for loop of
for i = 1:(N+1)
z(i) = pi*(i-1)/N;
end
and replace with
i=1:(N+1)
z = pi*(i-1)/N
I did not check the rest of you code but the above example will definitely speed up you code. And a second strategy is to combine loops when possible.
Martin,
Consider the following strategy.
% create hypothetical N and f
N = 3
f = #(x) 1./(1+15*x.*x)
% calculate z and t
i=1:(N+1)
z = pi*(i-1)/N
t = f(cos(z))
% make a column vector of k's
k = (1:(N+1))'
% do this: s(k) = sum(t.*cos(z.*(k-1)))
s1 = t.*cos(z.*(k-1)) % should be a matrix with one row for each row of k
% via implicit expansion
s2 = sum(s1,2) % row sum, i.e., one value for each row of k
% do this: s(k) = s(k)-(f(1)+f(-1)*cos(pi*(k-1)))/2
s3 = s2 - (f(1)+f(-1)*cos(pi*(k-1)))/2
% calculate c
c = s3 .* 2/N
I'd really appreciate some help on parallelizing the following pseudo code in Julia (and I do apologize in advance for the long post):
P, Q # both K by N matrix, K = num features and N = num samples
X, Y # K*4 by N and K*2 by N matrices
tempX, tempY # column vectors of size K*4 and K*2
ndata # a dict from parsing a .m file to be used by a solver with JuMP and Ipopt
# serial version
for i = 1:N
ndata[P] = P[:, i] # technically requires a for loop from 1 to K since the dict has to be indexed element-wise
ndata[Q] = Q[:, i]
ndata_A = run_solver_A(ndata) # with a third-party package and JuMP, Ipopt
ndata_B = run_solver_B(ndata)
kX = 1, kY = 1
for j = 1:K
tempX[kX:kX+3] = [ndata_A[j][a], ndata_A[j][b], P[j, i], Q[j, i]]
tempY[kY:kY+1] = [ndata_B[j][a], ndata_B[j][b]]
kX += 4
kY += 2
end
X[:, i] = deepcopy(tempX)
Y[:, i] = deepcopy(tempY)
end
So obviously, this for loop can be executed independently as long as no columns of P and Q is accessed twice and the same column i of P and Q are accessed at a time. The only thing I need to be careful about is that column i of X and Y are correct pairs of tempX and tempY, and I don't care as much about whether the i = 1, ..., N order is maintained (hopefully that makes sense!).
I read both the official documentation and some online tutorials, and wrote the following with #spawn and fetch that works for the insertion part by replacing the ndata[j][a] etc. with placeholder numbers 1.0 and 180:
using Distributed
addprocs(2)
num_proc = nprocs()
#everywhere function insertPQ(P, Q)
println(myid())
data = zeros(4*length(P))
k = 1
for i = 1:length(P)
data[k:k+3] = [1.0, 180., P[i], Q[i]]
k += 4
end
return data
end
P = [0.99, 0.99, 0.99, 0.99]
Q = [-0.01, -0.01, -0.01, -0.01]
for i = 1:5 # should be 4 x 32
global P = hcat(P, (P .- 0.01))
global Q = hcat(Q, (Q .- 0.01))
end
datas = zeros(16, 0) # serial result
datap = zeros(16, 32) # parallel result
#time for i = 1:32
s = fetch(#spawn insertPQ(P[:, i], Q[:, i]))
global datap = hcat(datap, s)
end
#time for i = 1:32
k = 1
for j = 1:4
datas[k:k+3, i] = [1.0, 180., P[j, i], Q[j, i]]
k += 4
end
end
println(datap == datas)
The above code is fine but I did notice the output was consistently worker 2->3->4->5->2... and was much slower than the serial case (I'm testing this on my laptop with only 4 cores, but eventually I'll run it on a cluster). It took forever to run when added in the run_solver_A/B in the insertPQ() that I had to stop it.
As for pmap(), I couldn't figure out how to pass an entire vector to the function. I probably misunderstood the documentation but "Transform collection c by applying f to each element using available workers and tasks" sounds like I can only do this element-wise? That can't be it. I went to a Julia intro session last week and asked the lecturer about this. He said I should use pmap and I've been trying to make it work since.
So, how can I parallelize the my original pseudo code? Any help or suggestion is greatly appreciated!
I would like to calculate the summation of elements from a large upper triangular matrix. The regular Julia code is below.
function upsum(M); n = size(M)[1]; sum = 0
for i = 1:n-1 for j = i+1:n
sum = sum + M[i,j]
end
end
return sum
end
R = randn(10000,10000)
upsum(R)
Since the matrix is very large, I would like to know is there anyway to improve the speed. How can I use parallel computing here?
I would use threads not parallel processing in this case. Here is an example code:
using Base.Threads
function upsum_threads(M)
n = size(M, 1)
chunks = nthreads()
sums = zeros(eltype(M), chunks)
chunkend = round.(Int, n * sqrt.((1:chunks) ./ chunks))
#assert minimum(diff(chunkend)) > 0
chunkstart = [2; chunkend[1:end-1] .+ 1]
#threads for job in 1:chunks
s = zero(eltype(M))
for i in chunkstart[job]:chunkend[job]
#simd for j in 1:(i-1)
#inbounds s += M[j, i]
end
end
sums[job] = s
end
return sum(sums)
end
R = randn(10000,10000)
upsum_threads(R)
It should give you a significant speedup (even if you remove #threads it should be much faster).
You choose number of threads Julia uses by setting JULIA_NUM_THREADS environment variable.
I'm trying to determine if there is a way to parallelize the Jacobi method using sparse matrix formats (specifically Compressed Row Format)
I have a working sparse matrix Jacobi. I don't know if I can place
!$OMP PARALLEL DO
Directives on the middle do loop because x is being both written to and read from. I guess the inner do loop can have it, but the same t is being overwritten so I don't know if it is possible there either. Am I overlooking something here? Thanks.
x(:) = 0
do p = 1, numIterations
do i=1, n
t=b(i)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = t - A(j) * x(jA(j))
end if
end do
x(i) = t/d
end do
end do
It is true you have a dependency on t in the inner loop since it used as an accumulator. However, that also means you can have a private copy of t in each of the threads (since the arrays A and x are not being written in the loop, the value of t only depends on the value of j, which is also thread private).
The following should work:
x(:) = 0
do p = 1, numIterations
do i=1, n
t=0
!$OMP PARALLEL DO
!$OMP REDUCTION(+:t)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = A(j) * x(jA(j))
end if
end do
x(i) = (b(i)-t)/d
end do
end do
Note that d can only be be written by one of the threads, so the variable can be shared betewen the threads, no loop-carried dependencies on d.