Create sparse matrix in parallel in Julia - parallel-processing

I am trying to parallelize the creation of a sparse matrix in Julia. Inspired by this post this post I am trying this:
using Distributed
addprocs(4)
#everywhere using DistributedArrays
rows = [Int[] for _ in procs()]
cols = [Int[] for _ in procs()]
vals = [Float64[] for _ in procs()]
distribute(rows)
distribute(cols)
distribute(vals)
#sync #distributed for i = 1:1000
for j = 1:1000
v = exp(-(i - j)^2)
if v > 0.1
push!(localpart(rows)[1], i)
push!(localpart(cols)[1], j)
push!(localpart(vals)[1], v)
end
end
end
ROWS = vcat(rows...)
COLS = vcat(cols...)
VALS = vcat(vals...)
K = sparse(ROWS, COLS, VALS)
# K = 0×0 SparseMatrixCSC{Float64, Int64} with 0 stored entries
This outputs an empty matrix, and it does not get filled. But I found that if I call #fetchfrom 2 rows, the rows that it creates is not empty. So it seems that it is just not combining everything.
How can I fix this?

distribute(rows) does not modify rows to be distributed; it returns a new distributed array filled with the input. You have to work with its result, something like
rows = distribute([Int[] for _ in procs()])

Related

Using SharedArray and pmap in Julia

I am thinking about using distributed computation in a problem I face. Suppose I have an index k which increases from 1 to 800 (for instance). And for each k, I have a pool p which has large size and many numbers stored in it. I want to get kth-pool recursively. The protocol is like, if I know (k-1)-th pool, then I can randomly choose two values z1, z2 from it and get a new value through a function f like z = f(z1,z2). Then I store it into k-th pool and repeated this many times until this pool is full and then I try to get (k+1)th-pool from kth-pool.
Due to the large size of the pool, I try to use parallel computation to speed up my Julia code. I am trying to use pmap and use a SharedArray as my (k-1)th-pool within each k. So I write the following code
using Distributed
addprocs(10)
#everywhere using LinearAlgebra
#everywhere using StatsBase
#everywhere using Statistics
#everywhere using DoubleFloats
#everywhere using StaticArrays
#everywhere using SharedArrays
#everywhere using JLD
#everywhere using Dates
#everywhere using Random
#everywhere using Printf
#everywhere function rand_haar2(::Val{n}) where n
M = #SMatrix randn(ComplexDF64, n,n)
q = qr(M).Q
L = cispi.(2 .* #SVector(rand(Double64,n)))
return q*diagm(L)
end
#everywhere function pool_calc(theta,pool::SharedArray,Np)
Random.seed!(myid())
pool_store = zeros(Double64,Np)
Kup= #SMatrix[Double64(cos(theta)) 0; 0 Double64(sin(theta))]
Kdown = #SMatrix[Double64(sin(theta)) 0; 0 Double64(cos(theta))]
P2up = kron(#SMatrix[Double64(1.) 0.;0. 1.], #SMatrix[1 0; 0 0])
P2down = kron(#SMatrix[Double64(1) 0;0 1],#SMatrix[0 0;0 1])
poolcount = 0
poolsize = length(pool)
while poolcount < Np
z1 = pool[rand(1:poolsize)]
rho1 = diagm(#SVector[z1,1-z1])
z2 = pool[rand(1:poolsize)]
rho2 = diagm(#SVector[z2,1-z2])
u1 = rand_haar2_slower(Val{2}())
u2 = rand_haar2_slower(Val{2}())
K1up = u1*Kup*u1'
K1down = u1*Kdown*u1'
K2up = u2*Kup*u2'
K2down = u2*Kdown*u2'
rho1p = K1up*rho1*K1up'
rho2p = K2up*rho2*K2up'
p1 = real(tr(rho1p+rho1p'))/2
p2 = real(tr(rho2p+rho2p'))/2
if rand()<p1
rho1p = (rho1p+rho1p')/(2*p1)
else
rho1p = K1down*rho1*K1down'/((1-p1))
end
if rand()<p2
rho2p = (rho2p+rho2p')/(2*p2)
else
rho2p = K2down*rho2*K2down'/((1-p2))
end
rho = kron(rho1p,rho2p)
U = rand_haar2_slower(Val{4}())
rho_p = P2up*U*rho*U'*P2up'
p = real(tr(rho_p+rho_p'))/2
if rand()<p
temp =(rho_p+rho_p')/2
rho_f = #SMatrix[temp[1,1]+temp[2,2] temp[1,3]+temp[2,4]; temp[3,1]+temp[4,2] temp[3,3]+temp[4,4]]/(p)
else
temp = P2down*U*rho*U'*P2down'
rho_f = #SMatrix[temp[1,1]+temp[2,2] temp[1,3]+temp[2,4]; temp[3,1]+temp[4,2] temp[3,3]+temp[4,4]]/(1-p)
end
rho_f = (rho_f+rho_f')/2
t = abs(tr(rho_f*rho_f))
z = (1-t)/(1+abs(sqrt(2*t-1)))
if !iszero(abs(z))
poolcount = poolcount+1
pool_store[poolcount] = abs(z)
end
end
return pool_store
end
function main()
theta = parse(Double64,ARGS[1])
Nk = parse(Int,ARGS[2])
S_curve = zeros(Double64,Nk)
S_var = zeros(Double64,Nk)
Npool = Int(floor(10^6))
pool = SharedArray{Double64}(Npool)
pool_sample = zeros(Double64,Npool)
spool = zeros(Double64,Npool)
pool .=0.5
for k =1:800
ret = pmap(Np->pool_calc(theta = theta,pool=pool,Np=Np),fill(10^5,10))
pool_target = reduce(vcat,[ret[i][1] for i = 1:10])
spool .=-pool_target .*log.(pool_target).-(1.0 .- pool_target).*log1p.(-pool_target)
S_curve[k] = mean(spool)
S_var[k] = (std(spool)/sqrt(Npool))^2
pool = pool_target
end
label = #sprintf "%.3f" Float32(theta)
save("entropy_real_128p_$(label)_ps6.jld","s", S_curve, "t", S_var)
end
main();
But I faced an error
How to solve this problem?
Thanks

get pairs / triple / quadruple... of elements from vector by function

I have a vector with a couple of elements and I want to write a function that returns me all combinations of x items from this vector.
The following code produces the right output for the case x=2 or x=3 or x=4.
However, I can not implement a solution for every possible x following this idea.
values = {'A','B','C','D','E'};
n = length(values);
data2 = {}; % case x=2
for i = 1:n
for j = i+1:n
data2{end+1} = {values{i}, values{j}};
fprintf('%s %s\n',values{i}, values{j})
end
end
data3 = {}; % case x=3
for i = 1:n
for j = i+1:n
for k = j+1:n
data3{end+1} = {values{i}, values{j}, values{k}};
fprintf('%s %s %s\n',values{i}, values{j}, values{k})
end
end
end
data4 = {}; % case x=4
for i = 1:n
for j = i+1:n
for k = j+1:n
for l = k+1:n
data4{end+1} = {values{i}, values{j}, values{k}, values{l}};
fprintf('%s %s %s %s\n',values{i}, values{j}, values{k}, values{l})
end
end
end
end
How would a function look like which would be able to return my data variable?
data = getCombinations(values, x) %values is vector with elements, x is integer value
EDIT
The following code comes pretty close:
data = perms(values)
data = data(:,1:x)
data = unique(data,'rows')
but it still produces output like A,B and B,A
EDIT2
This fixed it somehow but it is not very nice to look at and it does not work for text entries in cells but only for numbers
data = perms(values)
data = data(:,1:x)
data = sort(data,2)
data = unique(data,'rows')
EDIT3
This did it but it is not very nice to look at... Maybe there is a better solution?
function [data] = getCombinations(values,x)
i = 1:length(values);
d = perms(i);
d = d(:,1:x);
d = sort(d,2);
d = unique(d,'rows');
data = v(d);
end
If you don't want repetitions (and your example suggests you don't) then try nchoosek as nchoosek(1:n, x) to give indices:
values = {'A','B','C','D','E'};
n = length(values);
x = 3;
C = nchoosek(1:n, x);
data = values(C)
In the above, each row is a unique combination of 3 of the 5 elements of values.
Alternatively pass in the values directly:
data = nchoosek(values, x);

Image Processing: Algorithm taking too long in MATLAB

I am working in MATLAB to process two 512x512 images, the domain image and the range image. What I am trying to accomplish is the following:
Divide both domain and range images into 8x8 pixel blocks
For each 8x8 block in the domain image, I have to apply a linear transformations to it and compare each of the 4096 transformed blocks with each of the 4096 range blocks.
Compute error in each case between the transformed block and the range image block and find the minimum error.
Finally I'll have for each 8x8 range block, the id of the 8x8 domain block for which the error was minimum (error between the range block and the transformed domain block)
To achieve this, I have written the following code:
RangeImagecolor = imread('input.png'); %input is 512x512
DomainImagecolor = imread('input.png'); %Range and Domain images are identical
RangeImagetemp = rgb2gray(RangeImagecolor);
DomainImagetemp = rgb2gray(DomainImagecolor);
RangeImage = im2double(RangeImagetemp);
DomainImage = im2double(DomainImagetemp);
%For the (k,l)th 8x8 range image block
for k = 1:64
for l = 1:64
minerror = 9999;
min_i = 0;
min_j = 0;
for i = 1:64
for j = 1:64
%here I compute for the (i,j)th domain block, the transformed domain block stored in D_trans
error = 0;
D_trans = zeros(8,8);
R = zeros(8,8); %Contains the pixel values of the (k,l)th range block
for m = 1:8
for n = 1:8
R(m,n) = RangeImage(8*k-8+m,8*l-8+n);
%ApplyTransformation can depend on (k,l) so I can't compute the transformation outside the k,l loop.
[m_dash,n_dash] = ApplyTransformation(8*i-8+m,8*j-8+n);
D_trans(m,n) = DomainImage(m_dash,n_dash);
error = error + (R(m,n)-D_trans(m,n))^2;
end
end
if(error < minerror)
minerror = error;
min_i = i;
min_j = j;
end
end
end
end
end
As an example ApplyTransformation, one can use the identity transformation:
function [x_dash,y_dash] = Iden(x,y)
x_dash = x;
y_dash = y;
end
Now the problem I am facing is the high computation time. The order of computation in the above code is 64^5, which is of the order 10^9. This computation should take at the worst minutes or an hour. It takes about 40 minutes to compute just 50 iterations. I don't know why the code is running so slow.
Thanks for reading my question.
You can use im2col* to convert the image to column format so each block forms a column of a [64 * 4096] matrix. Then apply transformation to each column and use bsxfun to vectorize computation of error.
DomainImage=rand(512);
RangeImage=rand(512);
DomainImage_col = im2col(DomainImage,[8 8],'distinct');
R = im2col(RangeImage,[8 8],'distinct');
[x y]=ndgrid(1:8);
function [x_dash, y_dash] = ApplyTransformation(x,y)
x_dash = x;
y_dash = y;
end
[x_dash, y_dash] = ApplyTransformation(x,y);
idx = sub2ind([8 8],x_dash, y_dash);
D_trans = DomainImage_col(idx,:); %transformation is reduced to matrix indexing
Error = 0;
for mn = 1:64
Error = Error + bsxfun(#minus,R(mn,:),D_trans(mn,:).').^2;
end
[minerror ,min_ij]= min(Error,[],2); % linear index of minimum of each block;
[min_i min_j]=ind2sub([64 64],min_ij); % convert linear index to subscript
Explanation:
Our goal is to reduce number of loops as much as possible. For it we should avoid matrix indexing and instead we should use vectorization. Nested loops should be converted to one loop. As the first step we can create a more optimized loop as here:
min_ij = zeros(4096,1);
for kl = 1:4096 %%% => 1:size(D_trans,2)
minerror = 9999;
min_ij(kl) = 0;
for ij = 1:4096 %%% => 1:size(R,2)
Error = 0;
for mn = 1:64
Error = Error + (R(mn,kl) - D_trans(mn,ij)).^2;
end
if(Error < minerror)
minerror = Error;
min_ij(kl) = ij;
end
end
end
We can re-arrange the loops and we can make the most inner loop as the outer loop and separate computation of the minimum from the computation of the error.
% Computation of the error
Error = zeros(4096,4096);
for mn = 1:64
for kl = 1:4096
for ij = 1:4096
Error(kl,ij) = Error(kl,ij) + (R(mn,kl) - D_trans(mn,ij)).^2;
end
end
end
% Computation of the min
min_ij = zeros(4096,1);
for kl = 1:4096
minerror = 9999;
min_ij(kl) = 0;
for ij = 1:4096
if(Error(kl,ij) < minerror)
minerror = Error(kl,ij);
min_ij(kl) = ij;
end
end
end
Now the code is arranged in a way that can best be vectorized:
Error = 0;
for mn = 1:64
Error = Error + bsxfun(#minus,R(mn,:),D_trans(mn,:).').^2;
end
[minerror ,min_ij] = min(Error, [], 2);
[min_i ,min_j] = ind2sub([64 64], min_ij);
*If you don't have the Image Processing Toolbox a more efficient implementation of im2col can be found here.
*The whole computation takes less than a minute.
First things first - your code doesn't do anything. But you likely do something with this minimum error stuff and only forgot to paste this here, or still need to code that bit. Never mind for now.
One big issue with your code is that you calculate transformation for 64x64 blocks of resulting image AND source image. 64^5 iterations of a complex operation are bound to be slow. Rather, you should calculate all transformations at once and save them.
allTransMats = cell(64);
for i = 1 : 64
for j = 1 : 64
allTransMats{i,j} = getTransformation(DomainImage, i, j)
end
end
function D_trans = getTransformation(DomainImage, i,j)
D_trans = zeros(8);
for m = 1 : 8
for n = 1 : 8
[m_dash,n_dash] = ApplyTransformation(8*i-8+m,8*j-8+n);
D_trans(m,n) = DomainImage(m_dash,n_dash);
end
end
end
This serves to get allTransMat and is OUTSIDE the k, l loop. Preferably as a simple function.
Now, you make your big k, l, i, j loop, where you compare all the elements as needed. Comparison could be also done block-wise instead of filling a small 8x8 matrix, yet doing it per element for some reason.
m = 1 : 8;
n = m;
for ...
R = RangeImage(...); % This will give 8x8 output as n and m are vectors.
D = allTransMats{i,j};
difference = sum(sum((R-D).^2));
if (difference < minDifference) ...
end
Even though this is a simple no transformations case, this speeds up code a lot.
Finally, are you sure you need to compare each block of transformed output with each block in the source? Typically you compare block1(a,b) with block2(a,b) - blocks (or pixels) on the same position.
EDIT: allTransMats requires k and l too. Ouch. There is NO WAY to make this fast for a single iteration, as you require 64^5 calls to ApplyTransformation (or a vectorization of that function, but even then it might not be fast - we would have to see the function to help here).
Therefore, I will re-iterate my advice to generate all transformations and then perform lookup: this upper part of the answer with allTransMats generation should be changed to have all 4 loops and generate allTransMats{i,j,k,l};. It WILL be slow, there is no way around that as I mentioned in the upper part of edit. But, it is a cost you pay once, as after saving the allTransMats, all further image analyses will be able to simply load it instead of generating it again.
But ... what do you even do? Transformation that depends on source and destination block indices plus pixel indices (= 6 values total) sounds like a mistake somewhere, or a prime candidate to optimize instead of all the rest.

use different arrays for each workers instead of SharedArrays in Julia

I have a function like this:
#everywhere function bellman_operator!(rbc::RBC)
...
#sync #parallel for i = 1:m
....
for j = 1:n
v_max = -1000.0
...
for l = Next : n
......
if v > vmax
vmax = v
Next = l
else
break
end
end
f_v[j, i] = vmax
f_p[j, i] = k
end
end
end
f_v and f_p are sharedArrays, I want to give different arrays for result of each workers, I saw some sample but I can't fix it.How can I use arrays for result of each workers and finally combine the results instead of using SharedArrays?
Is this what you want?
Example 1. Combining results using +:
a = #parallel (+) for i in 1:1000
rand(10, 10)
end
Example 2. Just collecting the results without combining them:
x = Future[]
for i in 1:1000
push!(x, #spawn rand(10,10))
end
y = fetch.(x)

Faster concatenation of cell arrays of different sizes

I have a cell array of size m x 1 and each cell is again s x t cell array (size varies). I would like to concatenate vertically. The code is as follows:
function(cell_out) = vert_cat(cell_in)
[row,col] = cellfun(#size,cell_in,'Uni',0);
fcn_vert = #(x)([x,repmat({''},size(x,1),max(cell2mat(col))-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
end
Step 3 takes a lot of time. Is it the right way to do or is there any another faster way to achieve this?
cellfun has been found to be slower than loops (kind of old, but agrees with what I have seen).
In addition, repmat has also been a performance hit in the past (though that may be different now).
Try this two-loop code that aims to accomplish your task:
function cellOut = vert_cat(c)
nElem = length(c);
colPad = zeros(nElem,1);
nRow = zeros(nElem,1);
for k = 1:nElem
[nRow(k),colPad(k)] = size(c{k});
end
colMax = max(colPad);
colPad = colMax - colPad;
cellOut = cell(sum(nRow),colMax);
bottom = cumsum(nRow) - nRow + 1;
top = bottom + nRow - 1;
for k = 1:nElem
cellOut(bottom(k):top(k),:) = [c{k},cell(nRow(k),colPad(k))];
end
end
My test for this code was
A = rand(20,20);
A = mat2cell(A,ones(20,1),ones(20,1));
C = arrayfun(#(c) A(1:c,1:c),randi([1,15],1,5),'UniformOutput',false);
ccat = vert_cat(c);
I used this pice of code to generate data:
%generating some dummy data
m=1000;
s=100;
t=100;
cell_in=cell(m,1);
for idx=1:m
cell_in{idx}=cell(randi(s),randi(t));
end
Applying some minor modifications, I was able to speed up the code by a factor of 5
%Minor modifications of the original code
%use arrays instead of cells for row and col
[row,col] = cellfun(#size,cell_in);
%claculate max(col) once
tcol=max(col);
%use cell instead of repmat to generate an empty cell
fcn_vert = #(x)([x,cell(size(x,1),tcol-size(x,2))]);
cell_out = cellfun(fcn_vert,cell_in,'Uni',0); % Taking up lot of time
cell_out = vertcat(cell_out{:});
Using simply a for loop is even faster, because the data is only moved once
%new approac. Basic idea: move every data only once
[row,col] = cellfun(#size,cell_in);
trow=sum(row);
tcol=max(col);
r=1;
cell_out2 = cell(trow,tcol);
for idx=1:numel(cell_in)
cell_out2(r:r+row(idx)-1,1:col(idx))=cell_in{idx};
r=r+row(idx);
end

Resources