Julia: FAST way of calculating the smallest distances between two sets of points - performance

I have 5000 3D points in a Matrix A and another 5000 3D point in a matrix B.
For each point in A i want to find the smallest distance to a point in B. These distances should be stored in an array with 5000 entries.
So far I have this solution, running in about 0.145342 seconds (23 allocations: 191.079 MiB). How can I improve this further?
using Distances
A = rand(5000, 3)
B = rand(5000, 3)
mis = #time minimum(Distances.pairwise(SqEuclidean(), A, B, dims=1), dims=2)

This is a standard way to do it as it will have a better time complexity (especially for larger data):
using NearestNeighbors
nn(KDTree(B'; leafsize = 10), A')[2] .^ 2
Two comments:
by default Euclidean distance is computed (so I square it)
by default NearestNeigbors.jl assumes observations are stored in columns (so I need B' and A' in the solution; if your original data were transposed it would not be needed; the reason why it is designed this way is that Julia uses column major matrix storage)

Generating a big distance matrix using Distances.pairwise(SqEuclidean(), A, B, dims=1) is not efficient because the main memory is pretty slow nowadays compared to CPU caches and the computing power of modern CPUs and this is not gonna be better any time soon (see "memory wall"). It is faster to compute the minimum on-the-fly using two basic nested for loops. Additionally, one can use multiple cores to compute this faster using multiple threads.
function computeMinDist(A, B)
n, m = size(A, 1), size(B, 1)
result = zeros(n)
Threads.#threads for i = 1:n
minSqDist = Inf
#inbounds for j = 1:m
dx = A[i,1] - B[j,1]
dy = A[i,2] - B[j,2]
dz = A[i,3] - B[j,3]
sqDist = dx*dx + dy*dy + dz*dz
if sqDist < minSqDist
minSqDist = sqDist
result[i] = minSqDist
return result
mis = #time computeMinDist(A, B)
Note the Julia interpreter uses 1 thread by default but this can be tuned using the environment variable JULIA_NUM_THREADS=auto or just by running it using the flag --threads=auto. See the multi-threading documentation for more information.
Performance results
Here are performance results on my i5-9600KF machine with 6 cores (with two 5000x3 matrices):
Initial implementation: 93.4 ms
This implementation: 4.4 ms
This implementation is thus 21 times faster.
Results are the same to few ULP.
Note the code can certainly be optimized further using loop tiling, and possibly by transposing A and B so the JIT can generate a more efficient implementation using SIMD instructions.


Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
#everywhere begin
function negative_loglikelihood(Y,X,W)
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
return ll
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
return grad
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
for i=1:w
for j=1:h
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

matlab code optimization - clustering algorithm KFCG

I have a large set of vectors (orientation data in an axis-angle representation... the axis is the vector). I want to apply a clustering algorithm to. I tried kmeans but the computational time was too long (never finished). So instead I am trying to implement KFCG algorithm which is faster (Kirke 2010):
Initially we have one cluster with the entire training vectors and the codevector C1 which is centroid. In the first iteration of the algorithm, the clusters are formed by comparing first element of training vector Xi with first element of code vector C1. The vector Xi is grouped into the cluster 1 if xi1< c11 otherwise vector Xi is grouped into cluster2 as shown in Figure 2(a) where codevector dimension space is 2. In second iteration, the cluster 1 is split into two by comparing second element Xi2 of vector Xi belonging to cluster 1 with that of the second element of the codevector. Cluster 2 is split into two by comparing the second element Xi2 of vector Xi belonging to cluster 2 with that of the second element of the codevector as shown in Figure 2(b). This procedure is repeated till the codebook size is reached to the size specified by user.
I'm unsure what ratio is appropriate for the codebook, but it shouldn't matter for the code optimization. Also note mine is 3-D so the same process is done for the 3rd dimension.
My code attempts
I've tried implementing the above algorithm into Matlab 2013 (Student Version). Here's some different structures I've tried - BUT take way too long (have never seen it completed):
%training vectors:
Atgood = Nx4 vector (see test data below if want to test);
vecA = Atgood(:,1:3);
roA = size(vecA,1);
%Codebook size, Nsel, is ratio of data
Nseltemp = remainFrac2*roA; %codebook size
%Ensure selected size after nearest power of 2 is NOT greater than roA
if 2^round(log2(Nseltemp)) &lt roA
NselIter = round(log2(Nseltemp));
NselIter = ceil(log2(Nseltemp)-1);
Nsel = 2^NselIter; %power of 2 - for LGB and other algorithms
%%cluster = cell(1,Nsel); %Unsure #rows - Don't know how to initialize if need mean...
codevec(1,1:3) = mean(vecA,1);
for kk = 1:NselIter
hh2 = 1:2:size(codevec,1)*2;
for hh1 = 1:length(hh2)
% for ii = 1:roA
% if vecA(ii,ind) &lt codevec(hh1,ind)
% cluster{1,hh}(count1,1:4) = Atgood(ii,:); %want all 4 elements
% count1=count1+1;
% else
% cluster{1,hh+1}(count2,1:4) = Atgood(ii,:); %want all 4
% count2=count2+1;
% end
% end
%EDIT: My ATTEMPT at optimizing above for loop:
splitind = vecA(:,ind)&gt=repcv;
splitind2 = vecA(:,ind)&ltrepcv;
clear codevec
%Only mean the 1x3 vector portion of the cluster - for centroid
codevec = cell2mat((cellfun(#(x) mean(x(:,1:3),1),cluster,'UniformOutput',false))');
if ind &lt 3
ind = ind+1;
if length(codevec) ~= Nsel
warning('codevec ~= Nsel');
Alternatively, instead of cells I thought 3D Matrices would be faster? I tried but it was slower using my method of appending the next row each iteration (temp=[]; for...temp=[temp;new];)
Also, I wasn't sure what was best to loop with, for or while:
%If initialize cell to full length
while length(find(~cellfun('isempty',cluster))) < Nsel
Well, anyways, the first method was fastest for me.
Is the logic standard? Not in the sense that it matches with the algorithm described, but from a coding perspective, any weird methods I employed (especially with those multiple inner loops) that slows it down? Where can I speed up (you can just point me to resources or previous questions)?
My array size, Atgood, is 1,000,000x4 making NselIter=19; - do I just need to find a way to decrease this size or can the code be optimized?
Should this be asked on CodeReview? If so, I'll move it.
Testing Data
Here's some random vectors you can use to test:
for ii=1:1000 %My size is ~ 1,000,000
omega = 2*rand(3,1)-1;
omega = (omega/norm(omega))';
Atgood(ii,1:4) = [omega,57];
Your biggest issue is re-iterating through all of vecA FOR EACH CODEVECTOR, rather than just the ones that are part of the corresponding cluster. You're supposed to split each cluster on it's codevector. As it is, your cluster structure grows and grows, and each iteration is processing more and more samples.
Your second issue is the loop around the comparisons, and the appending of samples to build up the clusters. Both of those can be solved by vectorizing the comparison operation. Oh, I just saw your edit, where this was optimized. Much better. But codevec(hh1,ind) is just a scalar, so you don't even need the repmat.
Try this version:
% (preallocs added in edit)
cluster = cell(1,Nsel);
codevec = zeros(Nsel, 3);
codevec(1,:) = mean(Atgood(:,1:3),1);
cluster{1} = Atgood;
nClusters = 1;
ind = 1;
while nClusters < Nsel
for c = 1:nClusters
lower_cluster_logical = cluster{c}(:,ind) < codevec(c,ind);
cluster{nClusters+c} = cluster{c}(~lower_cluster_logical,:);
cluster{c} = cluster{c}(lower_cluster_logical,:);
codevec(c,:) = mean(cluster{c}(:,1:3), 1);
codevec(nClusters+c,:) = mean(cluster{nClusters+c}(:,1:3), 1);
ind = rem(ind,3) + 1;
nClusters = nClusters*2;

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
B = squeeze(sum(bsxfun(#times,w,A),1));
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
end do
end do
end function

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
methods (Access = private)
% swap1: stores values in temporary doubles
% This has the best performance
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
% swap3: does not use variables for storage.
% This has the worst performance
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
