Huge memory allocation running a julia function? - performance

I try to run the following function in julia command, but when timing the function I see too much memory allocations which I can't figure out why.
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
for m = 1:length(Pf)
i = 0
for k = 1:iters
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
#thresh = erfcinv(2Pf) * sqrt(2/L) + 1
#Pd_the = 0.5 * erfc(((thresh - (snr + 1)) * sqrt(L)) / (2*(snr + 1)))
end
Running that function in the julia command on my laptop, I get the following shocking numbers:
julia> #time pdpf(1000, 10000)
17.621551 seconds (9.00 M allocations: 30.294 GB, 7.10% gc time)
What is wrong with my code? Any help is appreciated.

I don't think this memory allocation is so surprising. For instance, consider all of the times that the inner loop gets executed:
for m = 1:length(Pf) this gives you 100 executions
for k = 1:iters this gives you 10,000 executions based on the arguments you supply to the function.
randn(L) this gives you a random vector of length 1,000, based on the arguments you supply to the function.
Thus, just considering these, you've got 100*10,000*1000 = 1 billion Float64 random numbers being generated. Each one of them takes 64 bits = 8 bytes. I.e. 8GB right there. And, you've got two calls to randn(L) which means that you're at 16GB allocations already.
You then have y = s + n which means another 8GB allocations, taking you up to 24GB. I haven't looked in detail on the remaining code to get you from 24GB to 30GB allocations, but this should show you that it's not hard for the GB allocations to start adding up in your code.
If you're looking at places to improve, I'll give you a hint that these lines can be improved by using the properties of normal random variables:
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
You should easily be able to cut down the allocations here from 24GB to 8GB in this way. Note that y will be a normal random variable here as you've defined it, and think up a way to generate a normal random variable with an identical distribution to what y has now.
Another small thing, snr is a constant inside your function. Yet, you keep taking its sqrt 1 million separate times. In some settings, 'checking your work' can be helpful, but I think that you can be confident the computer will get it right the first time and thus you don't need to make it keep re-doing this calculation ; ). There are other similar places you can improve your code to avoid duplicate computations here that I'll leave to you to locate.

aireties gives a good answer for why you have so many allocations. You can do more to reduce the number of allocations. Using this property we know that y = s+n is really y = sqrt(snr) * randn(L) + randn(L) and so we can instead do y = rvvar*randn(L) where rvvar= sqrt(1+sqrt(snr)^2) is defined outside the loop (thanks for the fix!). This will halve the number of random variables needed.
Outside the loop you can save sqrt(2/L) to cut down a little bit of time.
I don't think transpose is special-cased yet, so try using dot(y,y) instead of y'*y. I know dot for sure is just a loop without having to transpose, while the other may transpose depending on the version of Julia.
Something that would help performance (but not allocations) would be to use one big randn(L,iters) and loop through that. The reason is because if you make all of your random numbers all at once it's faster since it can use SIMD and a bunch of other goodies. If you want to implicitly do that without changing your code much, you can use ChunkedArrays.jl where you can use rands = ChunkedArray(randn,L) to initialize it and then everytime you want a randn(L), you instead use next(rands). Inside the ChunkedArray it actually makes bigger vectors and replenishes them as needed, but like this you can just get your randn(L) without having to keep track of all of that.
Edit:
ChunkedArrays probably only save time when L is smaller. This gives the code:
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
rvvar= sqrt(1+sqrt(snr)^2)
for m = 1:length(Pf)
i = 0
for k = 1:iters
y = rvvar*randn(L)
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
end
which runs in half the time as using two randn calls. Indeed from the ProfileViewer we get:
#profile pdpf(1000, 10000)
using ProfileView
ProfileView.view()
I circled the two parts for the line y = rvvar*randn(L), so the vast majority of the time is random number generation. Last time I checked you could still get a decent speedup on random number generation by changing to to VSL.jl library, but you need MKL linked to your Julia build. Note that from the Google Summer of Code page you can see that there is a project to make a repo RNG.jl with faster psudo-rngs. It looks like it already has a few new ones implemented. You may want to check them out and see if they give speedups (or help out with that project!)

Related

How to increase stack size for Julia in Windows?

I wrote a recursive function (basically a flood fill), it works fine on smaller datasets, but for slightly larger input it throws StackOverflowError.
How to increase the stack size for Julia under Windows 10? Ideally the solution should be applicable to JupyterLab.
It's a singe use program, no point in optimizing/rewriting it, I just need to peak at the result and forget about the code.
Update: As a test case, I provide the following MWE. This is just a simple algorithm that recursively visits each cell of n by n array:
n = 120
visited = fill(false, (n,n))
function visit_single_neighbour(i,j,Δi,Δj)
if 1 ≤ i + Δi ≤ n && 1 ≤ j + Δj ≤ n
if !visited[i+Δi, j+Δj]
visited[i+Δi, j+Δj] = true
visit_four_neighbours(i+Δi, j+Δj)
end
end
end
function visit_four_neighbours(i,j)
visit_single_neighbour(i,j,1,0)
visit_single_neighbour(i,j,0,1)
visit_single_neighbour(i,j,-1,0)
visit_single_neighbour(i,j,0,-1)
end
#time visit_four_neighbours(1,1)
For n = 120 the output is 0.003341 seconds, but for n = 121 it throws StackOverflowError.
On a Linux machine with ulimit -s unlimited the code runs no problem for n = 2000 and takes about 2.4 seconds.
I've mirrored the question to Julia Discource: https://discourse.julialang.org/t/ow-to-increase-stack-size-for-julia-in-windows/79932
As you are no doubt aware Julia is not very optimized for recursion and the recommendation will probably always be to rewrite the code in some way.
With that said there are of course ways to increase the stack limit. One undocumented way to achieve it from inside julia is to reserve stack space when creating a Task:
Core.Task(f, reserved_stack::Int=0)
Let's create a function wrapping such a task:
with_stack(f, n) = fetch(schedule(Task(f,n)))
for n = 2000 the following works on both windows and linux (as long as enough memory is available):
julia> with_stack(2_000_000_000) do
visit_four_neighbours(1,1)
end

Vectorization or alternative to speed up MATLAB loop

I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.

Julia: use of pmap with Arrays vs SharedArrays

I have been working in Julia for a few months now and I am interested in writing some of my code in parallel. I am working on a problem where I use 1 model to generate data for several different receivers (the data for each receiver is a vector). The data for each receiver can be computed independently which leads me to believe I should be able to use the pmap function. My plan is to initialize the data as a 2D SharedArray (each column represents the data for 1 receiver) and then have pmap loop over each of the columns. However I am finding that using SharedArray's with pmap is no faster than working in serial using map. I wrote the following dummy code to illustrate this point.
#everywhere function Dummy(icol,model,data,A,B)
nx = 250
nz = 250
nh = 50
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
function main()
nx = 250
nz = 250
nh = 50
nt = 500
ncol = 100
model1 = rand(nz,nx,nh)
model2 = copy(model1)
model3 = convert(SharedArray,model1)
data1 = zeros(Float64,nt,ncol)
data2 = SharedArray(Float64,nt,ncol)
data3 = SharedArray(Float64,nt,ncol)
A1 = rand(nz,nx,nh)
A2 = copy(A1)
A3 = convert(SharedArray,A1)
B1 = rand(nz,nx,nh)
B2 = copy(B1)
B3 = convert(SharedArray,B1)
#time map((arg)->Dummy(arg,model1,data1,A1,B1),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model2,data2,A2,B2),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model3,data3,A3,B3),[icol for icol = 1:ncol])
println(data1==data2)
println(data1==data3)
end
main()
I start the Julia session with Julia -p 3 and run the script. The times for the 3 tests are 1.4s, 4.7s, and 1.6s respectively. Using SharedArrays with pmap (1.6s runtime) hasn't provided any improvement in speed compared with regular Arrays with map (1.4s). I am also confused as to why the 2nd case (data as a SharedArray, all other inputs as a regular Array with pmap) is so slow. What do I need to change in order to benefit from working in parallel?
Preface: yes, there actually is a solution to your issue. See code at bottom for that. But, before I get there, I'll go into some explanation.
I think the root of the problem here is memory access. First off, although I haven't rigorously investigated it, I suspect that there are a moderate number of improvements that could be made to Julia's underlying code in order to improve the way that it handles memory access in parallel processing. Nevertheless, in this case, I suspect that any underlying issues with the base code, if such actually exist, aren't so much at fault. Instead, I think it is useful to think carefully about what exactly is going on in your code and what it means vis-a-vis memory access.
A key thing to keep in mind when working in Julia is that it stores Arrays in column-major order. That is, it stores them as stacks of columns on top of each other. This generalizes to dimensions > 2 as well. See this very helpful segment of the Julia performance tips for more info. The implication of this is that it is fast to access one row after another after another in a single column. But, if you need to be jumping around columns, then you get into trouble. Yes, accessing ram memory might be relatively fast, but accessing cache memory is much, much faster, and so if your code allows for a single column or so to be loaded from ram into cache and then worked on, then you'll do much better than if you need to be doing lots of swapping between ram and cache. Here in your code, you're switching from column to column between your computations like nobody's business. For instance, in your pmap each process gets a different column of the shared array to work on. Then, each goes down the rows of that column and modifies the values in it. But, since they are trying to work in parallel with one another, and the whole array is too big to fit into your cache, there is lots of swapping between ram and cache that happens and that really slows you down. In theory, perhaps a clever enough under-the-hood memory management system could be devised to address this, but I don't really know - that goes beyond my pay grade. The same thing of course is happening to your accesses to your other objects.
Another thing to keep in mind in general when parallelizing is your ratio of flops (i.e. computer calculations) to read/write operations. Flops tend to parallelize well, you can have different cores, processes, etc. doing their own little computations on their own bits of data that they hold in their tiny caches. But, read/write operations don't parallelize so well. There are some things that can be done to engineer hardware systems to improve on this. But in general, if you have a given computer system with say, two cores, and you add four more cores to it, your ability to perform flops will increase three fold, but your ability to read/write data to/from ram won't really improve so much. (note: this is an oversimplication, a lot depends on your system). Nevertheless, in general, the higher your ratio of flops to read/writes, the more benefits you can expect from parallelism. In your case, your code involves a decent number of read/writes (all of those accesses to your different arrays) for a relatively small number of flops (a few multiplactions and an addition). It's just something to keep in mind.
Fortunately, your case is amenable to some good speedups from parallelism if written correctly. In my experience with Julia, all of my most successful parallelism comes when I can break data up and have workers process chunks of it separately. Your case happens to be amenable to that. Below is an example of some code I wrote that does that. You can see that it gets nearly a 3x increase in speed going from one processor to three. The code a bit crude in places, but it at least demonstrates the general idea of how something like this could be approached. I give a few comments on the code afterwards.
addprocs(3)
nx = 250;
nz = 250;
nh = 50;
nt = 250;
#everywhere ncol = 100;
model = rand(nz,nx,nh);
data = SharedArray(Float64,nt,ncol);
A = rand(nz,nx,nh);
B = rand(nz,nx,nh);
function distribute_data(X, obj_name_on_worker::Symbol, dim)
size_per_worker = floor(Int,size(X,1) / nworkers())
StartIdx = 1
EndIdx = size_per_worker
for (idx, pid) in enumerate(workers())
if idx == nworkers()
EndIdx = size(X,1)
end
println(StartIdx:EndIdx)
if dim == 3
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:,:])))
elseif dim == 2
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:])))
end
StartIdx = EndIdx + 1
EndIdx = EndIdx + size_per_worker - 1
end
end
distribute_data(model, :model, 3)
distribute_data(A, :A, 3)
distribute_data(B, :B, 3)
distribute_data(data, :data, 2)
#everywhere function Dummy(icol,model,data,A,B)
nx = size(model, 2)
nz = size(A,1)
nh = size(model, 3)
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
regular_test() = map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
function parallel_test()
#everywhere begin
if myid() != 1
map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
end
end
end
#time regular_test(); # 2.120631 seconds (307 allocations: 11.313 KB)
#time parallel_test(); # 0.918850 seconds (5.70 k allocations: 337.250 KB)
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(#spawnat(p, getfield(mod, nm)))
function recombine_data(Data::Symbol)
Results = cell(nworkers())
for (idx, pid) in enumerate(workers())
Results[idx] = getfrom(pid, Data)
end
return vcat(Results...)
end
#time P_Data = recombine_data(:data); # 0.003132 seconds
P_Data == data ## true
Comments
The use of the SharedArray is quite superfluous here. I just used it since it lends itself easily to being modified in place, which is how your code is originally written. This let me work more directly based on what you wrote without modifying it as much.
I didn't include the step to bring the data back in the time trial, but as you can see, it's quite a trivial amount of time in this case. In other situations, it might be less trivial, but data movement is just one of those issues that you face with parallelism.
When doing time trials in general, it is considered best practice to run the function once (in order to compile the code) and then run it again to get the times. That's what I did here.
See this SO post for where I got inspiration for some of these functions that I used here.

Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
tic;
B = squeeze(sum(bsxfun(#times,w,A),1));
toc;
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
enddo
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
Timing
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
!==================================================
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
enddo
end do
end do
end function

Resources