How can I set Julia `distributed` - parallel-processing

I have a question about how to use parallel computing in Julia
Following codes do not work
using Distributed
addprocs(10)
#everywhere include("ADMM2.jl")
#everywhere tuning = [0.04, 0.5, 0.1]
#everywhere include("Basicsetting.jl")
#everywhere using SharedArrays
## generate samples
n_simu = 10
Z_set = SharedArray{Float64, 3}(n, r, n_simu)
X_set = SharedArray{Float64, 3}(n, p, n_simu)
Y_set = SharedArray{Float64, 3}(n, q, n_simu)
Binit_set = SharedArray{Float64, 3}(p, r, n_simu)
Ginit_set = SharedArray{Float64, 3}(p, r, n_simu)
for i in 1:n_simu
dataset = get_data(fun_list, n, p, q, B_true, G_true, snr, binary = false)
Z_set[:,:,i] = dataset[:Z_scaled]
X_set[:,:,i] = dataset[:X]
Y_set[:,:,i] = dataset[:Y]
ridge = get_B_ridge(dataset[:Z_scaled], dataset[:X], dataset[:Y], lambda=0.03)
Binit_set[:,:,i] = ridge[:B]
Ginit_set[:,:,i] = ridge[:G]
end
## optimization process
#sync #distributed for i in 1:n_simu
Z = Z_set[:,:,i]
X = X_set[:,:,i]
Y = Y_set[:,:,i]
B = copy(Binit_set[:,:,i])
G = copy(Ginit_set[:,:,i])
result2[i] = get_BG_ADMM3(Z,X,Y,B,G, lambda1=0.05, lambda2=0.2, lambda3=0.05, rho=1.0,
control1 = Dict(:max_iter => 5e1, :tol => 1e-4, :rounding => 0.0),
control2 = Dict(:elesparse_B => true, :lowrank_G => true, :elesparse_G => false, :rowsparse_G => true))
end
Without using distributed, the for loop hasn't any problem operating.

You are not collecting any results in the for loop.
Please note that each variable in a for loop will be created on a different worker process of the Julia cluster.
Normally the best strategy is to used an aggregator function to collect the results (for most scenarios I would also prefer such approach over SharedArrays):
result = #distributed (append!) for i in 1:10
res = rand()+i
[res]
end
BTW, use #views - you are now creating unnecessary copies of your matrices

Related

Using SharedArray and pmap in Julia

I am thinking about using distributed computation in a problem I face. Suppose I have an index k which increases from 1 to 800 (for instance). And for each k, I have a pool p which has large size and many numbers stored in it. I want to get kth-pool recursively. The protocol is like, if I know (k-1)-th pool, then I can randomly choose two values z1, z2 from it and get a new value through a function f like z = f(z1,z2). Then I store it into k-th pool and repeated this many times until this pool is full and then I try to get (k+1)th-pool from kth-pool.
Due to the large size of the pool, I try to use parallel computation to speed up my Julia code. I am trying to use pmap and use a SharedArray as my (k-1)th-pool within each k. So I write the following code
using Distributed
addprocs(10)
#everywhere using LinearAlgebra
#everywhere using StatsBase
#everywhere using Statistics
#everywhere using DoubleFloats
#everywhere using StaticArrays
#everywhere using SharedArrays
#everywhere using JLD
#everywhere using Dates
#everywhere using Random
#everywhere using Printf
#everywhere function rand_haar2(::Val{n}) where n
M = #SMatrix randn(ComplexDF64, n,n)
q = qr(M).Q
L = cispi.(2 .* #SVector(rand(Double64,n)))
return q*diagm(L)
end
#everywhere function pool_calc(theta,pool::SharedArray,Np)
Random.seed!(myid())
pool_store = zeros(Double64,Np)
Kup= #SMatrix[Double64(cos(theta)) 0; 0 Double64(sin(theta))]
Kdown = #SMatrix[Double64(sin(theta)) 0; 0 Double64(cos(theta))]
P2up = kron(#SMatrix[Double64(1.) 0.;0. 1.], #SMatrix[1 0; 0 0])
P2down = kron(#SMatrix[Double64(1) 0;0 1],#SMatrix[0 0;0 1])
poolcount = 0
poolsize = length(pool)
while poolcount < Np
z1 = pool[rand(1:poolsize)]
rho1 = diagm(#SVector[z1,1-z1])
z2 = pool[rand(1:poolsize)]
rho2 = diagm(#SVector[z2,1-z2])
u1 = rand_haar2_slower(Val{2}())
u2 = rand_haar2_slower(Val{2}())
K1up = u1*Kup*u1'
K1down = u1*Kdown*u1'
K2up = u2*Kup*u2'
K2down = u2*Kdown*u2'
rho1p = K1up*rho1*K1up'
rho2p = K2up*rho2*K2up'
p1 = real(tr(rho1p+rho1p'))/2
p2 = real(tr(rho2p+rho2p'))/2
if rand()<p1
rho1p = (rho1p+rho1p')/(2*p1)
else
rho1p = K1down*rho1*K1down'/((1-p1))
end
if rand()<p2
rho2p = (rho2p+rho2p')/(2*p2)
else
rho2p = K2down*rho2*K2down'/((1-p2))
end
rho = kron(rho1p,rho2p)
U = rand_haar2_slower(Val{4}())
rho_p = P2up*U*rho*U'*P2up'
p = real(tr(rho_p+rho_p'))/2
if rand()<p
temp =(rho_p+rho_p')/2
rho_f = #SMatrix[temp[1,1]+temp[2,2] temp[1,3]+temp[2,4]; temp[3,1]+temp[4,2] temp[3,3]+temp[4,4]]/(p)
else
temp = P2down*U*rho*U'*P2down'
rho_f = #SMatrix[temp[1,1]+temp[2,2] temp[1,3]+temp[2,4]; temp[3,1]+temp[4,2] temp[3,3]+temp[4,4]]/(1-p)
end
rho_f = (rho_f+rho_f')/2
t = abs(tr(rho_f*rho_f))
z = (1-t)/(1+abs(sqrt(2*t-1)))
if !iszero(abs(z))
poolcount = poolcount+1
pool_store[poolcount] = abs(z)
end
end
return pool_store
end
function main()
theta = parse(Double64,ARGS[1])
Nk = parse(Int,ARGS[2])
S_curve = zeros(Double64,Nk)
S_var = zeros(Double64,Nk)
Npool = Int(floor(10^6))
pool = SharedArray{Double64}(Npool)
pool_sample = zeros(Double64,Npool)
spool = zeros(Double64,Npool)
pool .=0.5
for k =1:800
ret = pmap(Np->pool_calc(theta = theta,pool=pool,Np=Np),fill(10^5,10))
pool_target = reduce(vcat,[ret[i][1] for i = 1:10])
spool .=-pool_target .*log.(pool_target).-(1.0 .- pool_target).*log1p.(-pool_target)
S_curve[k] = mean(spool)
S_var[k] = (std(spool)/sqrt(Npool))^2
pool = pool_target
end
label = #sprintf "%.3f" Float32(theta)
save("entropy_real_128p_$(label)_ps6.jld","s", S_curve, "t", S_var)
end
main();
But I faced an error
How to solve this problem?
Thanks

Tricks to improve the performance of a cunstom function in Julia

I am replicating using Julia a sequence of steps originally made in Matlab. In Octave, this procedure takes 1.4582 seconds and in Julia (using Jupyter) it takes approximately 10 seconds. I'll try to be brief in the scripts. My goal is to achieve or improve Octave's performance. First of all, I will describe my variables and some function:
zgrid (double 1x7 size)
kgrid (double 500x1 size)
V0 (double 500x7 size)
P (double 7x7 size) a transition matrix
delta and beta are fixed parameters.
F(z,k) and u(c) are particular functions and are specified in the Julia script.
% Octave script
% V0 is given
[K, Z, K2] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
K2 = permute(K2, [2, 1, 3]);
C = max(f(Z,K) + (1-delta)*K - K2,0);
U = u(C);
EV = V0*P';% EV is a 500x7 matrix size
EV = permute(repmat(EV, 1, 1, 500), [3, 2, 1]);
H = U + beta*EV;
[TV, index] = max(H, [], 3);
In Julia, I created a function that replicates this procedure. I used loops, but it has a performance 9 times longer.
% Julia script
% V0 is the input of my T operator function
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
F = (z,k) -> exp(z)*(k^α);
u = (c) -> (c^(1-μ) - 1)/(1-μ)
% parameters
α = 1/3
β = 0.987
δ = 0.012;
μ = 2
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
function T(V)
E=V*P'
T1 = zeros(Float64, 500, 7 )
aux = zeros(Float64, 500)
for i = 1:7
for j = 1:500
for l = 1:500
c= maximum( (F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j] - kgrid[l],0))
aux[l] = u(c) + β*E[l,i]
end
T1[j,i] = maximum(aux)
end
end
return T1
end
I would very much like to improve my performance in Julia. I believe there is a way to improve, but I am new in Julia programming.
This code runs for me in 5ms. Note that I have made F and u into proper (not anonymous) functions, F_ and u_, but you could get a similar effect by making the anonymous functions const.
Your main problem is that you have a lot of non-const global variables, and also that your main function is doing unnecessary work multiple times, and creating an unnecessary array, aux.
The performance tips section in the manual is essential reading: https://docs.julialang.org/en/v1/manual/performance-tips/
F_(z,k) = exp(z) * (k^(1/3)); # you can still use α, but it must be const
u_(c) = (c^(1-2) - 1)/(1-2)
function T_(V, P, kgrid, zgrid, β, δ)
E = V * P'
T1 = similar(V)
for i in axes(T1, 2)
for j in axes(T1, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
T1[j,i] = aux
end
end
return T1
end
Benchmark:
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
zgrid = sort!(rand(1, 7); dims=2)
kgrid = sort!(rand(500, 1); dims=1)
P = rand(length(zgrid), length(zgrid))
#btime T_($V0, $P, $kgrid, $zgrid, $β, $δ);
# output: 5.126 ms (4 allocations: 54.91 KiB)
The following should perform much better. The most noticeable differences are that it calculates F 500x less, and doesn't rely on global variables.
function T(V,kgrid,zgrid,β,δ)
E=V*P'
T1 = zeros(Float64, 500, 7)
for j = 1:500
for i = 1:7
x = F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j]
T1[j,i] = maximum(u(max(x - kgrid[l], 0)) + β*E[l,i] for l in 1:500)
end
end
return T1
end

How to improve the precessing speed of tensorflow sess.run()

I'm trying to implement Recurrent Lest-Squares filter (RLS) implemented by tensorflow. Use about 300,000 data to train.It costs about 3min. is there any way to reduce the time consumption.
def rls_filter_graphic(x, d, length, Pint, Wint): # Projection matrix
P = tf.Variable(Pint, name='P')
# Filter weight
w = tf.Variable(Wint, name='W')
# Get output of filter
y = tf.matmul(x, w)
# y mast round,if(y>0) y=int(y) else y=int(y+0.5)
y = tf.round(y)
# get expect error which means expect value substract out put value
e = tf.subtract(d, y)
# get gain it is equal as k=P(n-1)x(n)/[lamda+x(n)P(n-1)x(n-1)]
tx = tf.transpose(x)
Px = tf.matmul(P, tx)
xPx = tf.matmul(x, Px)
xPx = tf.add(xPx, LAMDA)
k = tf.div(Px, xPx)
# update w(n) with k(n) as w(n)=w(n-1)+k(n)*err
wn = tf.add(w, tf.matmul(k, e))
# update P(n)=[P(n-1)-K(n)x(n)P(n-1)]/lamda
xP = tf.matmul(x, P)
kxP = tf.matmul(k, xP)
Pn = tf.subtract(P, kxP)
Pn = tf.divide(Pn, LAMDA)
update_P = tf.assign(P, Pn)
update_W = tf.assign(w, wn)
return y, e, w, k, P, update_P, update_W
And this will called in this func
def inter_band_predict(dataset, length, width, height):
img = np.zeros(width * height)
# use iterator get the vectors
itr = dataset.make_one_shot_iterator()
(x, d) = itr.get_next()
# build the predict graphics
ini_P, ini_W, = rls_filter_init(0.001, length)
y_p, err, weight, kn, Pn, update_P, update_W = rls_filter_graphic(x, d, length, ini_P, ini_W)
# err=tf.matmul(x,ini_W)
# init value
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
for i in range(height*width):
[e, trainP, trainW] = sess.run([err, update_P, update_W])
img[i] = e
return img
I found that call sess.run() per every loop is so costly.Is there any way to avoid this.
by the way,the dataset is the form as this
(x, d)
(
[0.1,1.0],[1.0]
[0.0,2.0],[2.0]
[0.2,3.0],[1.0]
……
)
    

The levenberg-marquardt method for solving non-linear equations

I tried implement the levenberg-marquardt method for solving non-linear equations on Julia based on Numerical Optimization using the
Levenberg-Marquardt Algorithm presentation. This my code:
function get_J(ArrOfFunc,X,delta)
N = length(ArrOfFunc)
J = zeros(Float64,N,N)
for i = 1:N
for j=1:N
Temp = copy(X);
Temp[j]=Temp[j]+delta;
J[i,j] = (ArrOfFunc[i](Temp)-ArrOfFunc[i](X))/delta;
end
end
return J
end
function get_resudial(ArrOfFunc,Arg)
return map((x)->x(Arg),ArrOfFunc)
end
function lm_solve(Funcs,Init)
X = copy(Init)
delta = 0.01;
Lambda = 0.01;
Factor = 2;
J = get_J(Funcs,X,delta)
R = get_resudial(Funcs,X)
N = 5
for t = 1:N
G = J'*J+Lambda.*eye(length(X))
dC = J'*R
C = sum(R.*R)/2;
Xnew = X-(inv(G)\dC);
Rnew = get_resudial(Funcs,Xnew)
Cnew = sum(Rnew.*Rnew)/2;
if ( Cnew < C)
X = Xnew;
R = Rnew;
Lambda = Lambda/Factor;
J = get_J(Funcs,X,delta)
else
Lambda = Lambda*Factor;
end
if(maximum(abs(Rnew)) < 0.001)
return X
end
end
return X
end
function test()
ArrOfFunc = [
(X)->X[1]+X[2]-2;
(X)->X[1]-X[2]
];
X = lm_solve(ArrOfFunc,Float64[3;3])
println(X)
return X
end
But from any starting point the step not accepted. What's I doing wrong?
Any help would be appreciated.
I have at the moment no way to test this, but one line does not make sense mathematically:
In the computation of Xnew it should be either inv(G)*dC or G\dC, but not a mix of both. Preferably the second, since the solution of a linear system does not require the computation of the inverse matrix.
With this one wrong calculation at the center of the iteration, the trajectory of the computation is almost surely going astray.

Why I got this Error The variable in a parfor cannot be classified

I'm trying to use parfor to estimate the time it takes over 96 sec and I've more than one image to treat but I got this error:
The variable B in a parfor cannot be classified
this the code I've written:
Io=im2double(imread('C:My path\0.1s.tif'));
Io=double(Io);
In=Io;
sigma=[1.8 20];
[X,Y] = meshgrid(-3:3,-3:3);
G = exp(-(X.^2+Y.^2)/(2*1.8^2));
dim = size(In);
B = zeros(dim);
c = parcluster
matlabpool(c)
parfor i = 1:dim(1)
for j = 1:dim(2)
% Extract local region.
iMin = max(i-3,1);
iMax = min(i+3,dim(1));
jMin = max(j-3,1);
jMax = min(j+3,dim(2));
I = In(iMin:iMax,jMin:jMax);
% Compute Gaussian intensity weights.
H = exp(-(I-In(i,j)).^2/(2*20^2));
% Calculate bilateral filter response.
F = H.*G((iMin:iMax)-i+3+1,(jMin:jMax)-j+3+1);
B(i,j) = sum(F(:).*I(:))/sum(F(:));
end
end
matlabpool close
any Idea?
Unfortunately, it's actually dim that is confusing MATLAB in this case. You can fix it by doing
[n, m] = size(In);
parfor i = 1:n
for j = 1:m
B(i, j) = ...
end
end

Resources