Related
My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
end
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
end
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
end
end
end
end
return Dict("Dist" => Dist, "Id" => Id)
end
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.
Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
continue
end
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
end
end
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
end
return dist_vec, ind_vec
end
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
Motivation:
In writing out a matrix operation that was to be performed over tens of thousands of vectors I kept coming across the warning:
Requested 200000x200000 (298.0GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or
preference panel for more information.
The reason for this was my use of diag() to get the values down the diagonal of an matrix inner product. Because MATLAB is generally optimized for vector/matrix operations, when I first write code, I usually go for the vectorized form. In this case, however, MATLAB has to build the entire matrix in order to get the diagonal which causes the memory and speed issues.
Experiment:
I decided to test the use of diag() vs a for loop to see if at any point it was more efficient to use diag():
num = 200000; % Matrix dimension
x = ones(num, 1);
y = 2 * ones(num, 1);
% z = diag(x*y'); % Expression to solve
% Loop approach
tic
z = zeros(num,1);
for i = 1 : num
z(i) = x(i)*y(i);
end
toc
% Dividing the too-large matrix into process-able chunks
fraction = [10, 20, 50, 100, 500, 1000, 5000, 10000, 20000];
time = zeros(size(fraction));
for k = 1 : length(fraction)
f = fraction(k);
% Operation to time
tic
z = zeros(num,1);
for i = 1 : k
first = (i-1) * (num / f);
last = first + (num / f);
z(first + 1 : last) = diag(x(first + 1: last) * y(first + 1 : last)');
end
time(k) = toc;
end
% Plot results
figure;
hold on
plot(log10(fraction), log10(chunkTime));
plot(log10(fraction), repmat(log10(loopTime), 1, length(fraction)));
plot(log10(fraction), log10(chunkTime), 'g*'); % Plot points along time
legend('Partioned Running Time', 'Loop Running Time');
xlabel('Log_{10}(Fractional Size)'), ylabel('Log_{10}(Running Time)'), title('Running Time Comparison');
This is the result of the test:
(NOTE: The red line represents the loop time as a threshold--it's not to say that the total loop time is constant regardless of the number of loops)
From the graph it is clear that it takes breaking the operations down into roughly 200x200 square matrices to be faster to use diag than to perform the same operation using loops.
Question:
Can someone explain why I'm seeing these results? Also, I would think that with MATLAB's ever-more optimized design, there would be built-in handling of these massive matrices within a diag() function call. For example, it could just perform the i = j indexed operations. Is there a particular reason why this might be prohibitive?
I also haven't really thought of memory implications for diag using the partition method, although it's clear that as the partition size decreases, memory requirements drop.
Test of speed of diag vs. a loop.
Initialization:
n = 10000;
M = randn(n, n); %create a random matrix.
Test speed of diag:
tic;
d = diag(M);
toc;
Test speed of loop:
tic;
d = zeros(n, 1);
for i=1:n
d(i) = M(i,i);
end;
toc;
This would test diag. Your code is not a clean test of diag...
Comment on where there might be confusion
Diag only extracts the diagonal of a matrix. If x and y are vectors, and you do d = diag(x * y'), MATLAB first constructs the n by n matrix x*y' and calls diag on that. This is why, you get the error, "cannot construct 290GB matrix..." Matlab interpreter does not optimize in a crazy way, realize you only want the diagonal and construct just a vector (rather than full matrix with x*y', that does not happen.
Not sure if you're asking this, but the fastest way to calculate d = diag(x*y') where x and y are n by 1 vectors would simply be: d = x.*y
I've stumbled upon the weird way (in my view) that Matlab is dealing with empty matrices. For example, if two empty matrices are multiplied the result is:
zeros(3,0)*zeros(0,3)
ans =
0 0 0
0 0 0
0 0 0
Now, this already took me by surprise, however, a quick search got me to the link above, and I got an explanation of the somewhat twisted logic of why this is happening.
However, nothing prepared me for the following observation. I asked myself, how efficient is this type of multiplication vs just using zeros(n) function, say for the purpose of initialization? I've used timeit to answer this:
f=#() zeros(1000)
timeit(f)
ans =
0.0033
vs:
g=#() zeros(1000,0)*zeros(0,1000)
timeit(g)
ans =
9.2048e-06
Both have the same outcome of 1000x1000 matrix of zeros of class double, but the empty matrix multiplication one is ~350 times faster! (a similar result happens using tic and toc and a loop)
How can this be? are timeit or tic,toc bluffing or have I found a faster way to initialize matrices?
(this was done with matlab 2012a, on a win7-64 machine, intel-i5 650 3.2Ghz...)
EDIT:
After reading your feedback, I have looked more carefully into this peculiarity, and tested on 2 different computers (same matlab ver though 2012a) a code that examine the run time vs the size of matrix n. This is what I get:
The code to generate this used timeit as before, but a loop with tic and toc will look the same. So, for small sizes, zeros(n) is comparable. However, around n=400 there is a jump in performance for the empty matrix multiplication. The code I've used to generate that plot was:
n=unique(round(logspace(0,4,200)));
for k=1:length(n)
f=#() zeros(n(k));
t1(k)=timeit(f);
g=#() zeros(n(k),0)*zeros(0,n(k));
t2(k)=timeit(g);
end
loglog(n,t1,'b',n,t2,'r');
legend('zeros(n)','zeros(n,0)*zeros(0,n)',2);
xlabel('matrix size (n)'); ylabel('time [sec]');
Are any of you experience this too?
EDIT #2:
Incidentally, empty matrix multiplication is not needed to get this effect. One can simply do:
z(n,n)=0;
where n> some threshold matrix size seen in the previous graph, and get the exact efficiency profile as with empty matrix multiplication (again using timeit).
Here's an example where it improves efficiency of a code:
n = 1e4;
clear z1
tic
z1 = zeros( n );
for cc = 1 : n
z1(:,cc)=cc;
end
toc % Elapsed time is 0.445780 seconds.
%%
clear z0
tic
z0 = zeros(n,0)*zeros(0,n);
for cc = 1 : n
z0(:,cc)=cc;
end
toc % Elapsed time is 0.297953 seconds.
However, using z(n,n)=0; instead yields similar results to the zeros(n) case.
This is strange, I am seeing f being faster while g being slower than what you are seeing. But both of them are identical for me. Perhaps a different version of MATLAB ?
>> g = #() zeros(1000, 0) * zeros(0, 1000);
>> f = #() zeros(1000)
f =
#()zeros(1000)
>> timeit(f)
ans =
8.5019e-04
>> timeit(f)
ans =
8.4627e-04
>> timeit(g)
ans =
8.4627e-04
EDIT can you add + 1 for the end of f and g, and see what times you are getting.
EDIT Jan 6, 2013 7:42 EST
I am using a machine remotely, so sorry about the low quality graphs (had to generate them blind).
Machine config:
i7 920. 2.653 GHz. Linux. 12 GB RAM. 8MB cache.
It looks like even the machine I have access to shows this behavior, except at a larger size (somewhere between 1979 and 2073). There is no reason I can think of right now for the empty matrix multiplication to be faster at larger sizes.
I will be investigating a little bit more before coming back.
EDIT Jan 11, 2013
After #EitanT's post, I wanted to do a little bit more of digging. I wrote some C code to see how matlab may be creating a zeros matrix. Here is the c++ code that I used.
int main(int argc, char **argv)
{
for (int i = 1975; i <= 2100; i+=25) {
timer::start();
double *foo = (double *)malloc(i * i * sizeof(double));
for (int k = 0; k < i * i; k++) foo[k] = 0;
double mftime = timer::stop();
free(foo);
timer::start();
double *bar = (double *)malloc(i * i * sizeof(double));
memset(bar, 0, i * i * sizeof(double));
double mmtime = timer::stop();
free(bar);
timer::start();
double *baz = (double *)calloc(i * i, sizeof(double));
double catime = timer::stop();
free(baz);
printf("%d, %lf, %lf, %lf\n", i, mftime, mmtime, catime);
}
}
Here are the results.
$ ./test
1975, 0.013812, 0.013578, 0.003321
2000, 0.014144, 0.013879, 0.003408
2025, 0.014396, 0.014219, 0.003490
2050, 0.014732, 0.013784, 0.000043
2075, 0.015022, 0.014122, 0.000045
2100, 0.014606, 0.014480, 0.000045
As you can see calloc (4th column) seems to be the fastest method. It is also getting significantly faster between 2025 and 2050 (I'd assume it would at around 2048 ?).
Now I went back to matlab to check for the same. Here are the results.
>> test
1975, 0.003296, 0.003297
2000, 0.003377, 0.003385
2025, 0.003465, 0.003464
2050, 0.015987, 0.000019
2075, 0.016373, 0.000019
2100, 0.016762, 0.000020
It looks like both f() and g() are using calloc at smaller sizes (<2048 ?). But at larger sizes f() (zeros(m, n)) starts to use malloc + memset, while g() (zeros(m, 0) * zeros(0, n)) keeps using calloc.
So the divergence is explained by the following
zeros(..) begins to use a different (slower ?) scheme at larger sizes.
calloc also behaves somewhat unexpectedly, leading to an improvement in performance.
This is the behavior on Linux. Can someone do the same experiment on a different machine (and perhaps a different OS) and see if the experiment holds ?
The results might be a bit misleading. When you multiply two empty matrices, the resulting matrix is not immediately "allocated" and "initialized", rather this is postponed until you first use it (sort of like a lazy evaluation).
The same applies when indexing out of bounds to grow a variable, which in the case of numeric arrays fills out any missing entries with zeros (I discuss afterwards the non-numeric case). Of course growing the matrix this way does not overwrite existing elements.
So while it may seem faster, you are just delaying the allocation time until you actually first use the matrix. In the end you'll have similar timings as if you did the allocation from the start.
Example to show this behavior, compared to a few other alternatives:
N = 1000;
clear z
tic, z = zeros(N,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = zeros(N,0)*zeros(0,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(N,N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = full(spalloc(N,N,0)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(1:N,1:N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
val = 0;
tic, z = val(ones(N)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = repmat(0, [N N]); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
The result shows that if you sum the elapsed time for both instructions in each case, you end up with similar total timings:
// zeros(N,N)
Elapsed time is 0.004525 seconds.
Elapsed time is 0.000792 seconds.
// zeros(N,0)*zeros(0,N)
Elapsed time is 0.000052 seconds.
Elapsed time is 0.004365 seconds.
// z(N,N) = 0
Elapsed time is 0.000053 seconds.
Elapsed time is 0.004119 seconds.
The other timings were:
// full(spalloc(N,N,0))
Elapsed time is 0.001463 seconds.
Elapsed time is 0.003751 seconds.
// z(1:N,1:N) = 0
Elapsed time is 0.006820 seconds.
Elapsed time is 0.000647 seconds.
// val(ones(N))
Elapsed time is 0.034880 seconds.
Elapsed time is 0.000911 seconds.
// repmat(0, [N N])
Elapsed time is 0.001320 seconds.
Elapsed time is 0.003749 seconds.
These measurements are too small in the milliseconds and might not be very accurate, so you might wanna run these commands in a loop a few thousand times and take the average. Also sometimes running saved M-functions is faster than running scripts or on the command prompt, as certain optimizations only happen that way...
Either way allocation is usually done once, so who cares if it takes an extra 30ms :)
A similar behavior can be seen with cell arrays or arrays of structures. Consider the following example:
N = 1000;
tic, a = cell(N,N); toc
tic, b = repmat({[]}, [N,N]); toc
tic, c{N,N} = []; toc
which gives:
Elapsed time is 0.001245 seconds.
Elapsed time is 0.040698 seconds.
Elapsed time is 0.004846 seconds.
Note that even if they are all equal, they occupy different amount of memory:
>> assert(isequal(a,b,c))
>> whos a b c
Name Size Bytes Class Attributes
a 1000x1000 8000000 cell
b 1000x1000 112000000 cell
c 1000x1000 8000104 cell
In fact the situation is a bit more complicated here, since MATLAB is probably sharing the same empty matrix for all the cells, rather than creating multiple copies.
The cell array a is in fact an array of uninitialized cells (an array of NULL pointers), while b is a cell array where each cell is an empty array [] (internally and because of data sharing, only the first cell b{1} points to [] while all the rest have a reference to the first cell). The final array c is similar to a (uninitialized cells), but with the last one containing an empty numeric matrix [].
I looked around the list of exported C functions from the libmx.dll (using Dependency Walker tool), and I found a few interesting things.
there are undocumented functions for creating uninitialized arrays: mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, and mxCreateUninitNumericMatrix. In fact there is a submission on the File Exchange makes use of these functions to provide a faster alternative to zeros function.
there exist an undocumented function called mxFastZeros. Googling online, I can see you cross-posted this question on MATLAB Answers as well, with some excellent answers over there. James Tursa (same author of UNINIT from before) gave an example on how to use this undocumented function.
libmx.dll is linked against tbbmalloc.dll shared library. This is Intel TBB scalable memory allocator. This library provides equivalent memory allocation functions (malloc, calloc, free) optimized for parallel applications. Remember that many MATLAB functions are automatically multithreaded, so I wouldn't be surprised if zeros(..) is multithreaded and is using Intel's memory allocator once the matrix size is large enough (here is recent comment by Loren Shure that confirms this fact).
Regarding the last point about the memory allocator, you could write a similar benchmark in C/C++ similar to what #PavanYalamanchili did, and compare the various allocators available. Something like this. Remember that MEX-files have a slightly higher memory management overhead, since MATLAB automatically frees any memory that was allocated in MEX-files using the mxCalloc, mxMalloc, or mxRealloc functions. For what it's worth, it used to be possible to change the internal memory manager in older versions.
EDIT:
Here is a more thorough benchmark to compare the discussed alternatives. It specifically shows that once you stress the use of the entire allocated matrix, all three methods are on equal footing, and the difference is negligible.
function compare_zeros_init()
iter = 100;
for N = 512.*(1:8)
% ZEROS(N,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, ZEROS = %.9f\n', N, mean(sum(t,2)))
% z(N,N)=0
t = zeros(iter,3);
for i=1:iter
clear z
tic, z(N,N) = 0; t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, GROW = %.9f\n', N, mean(sum(t,2)))
% ZEROS(N,0)*ZEROS(0,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,0)*zeros(0,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, MULT = %.9f\n\n', N, mean(sum(t,2)))
end
end
Below are the timings averaged over 100 iterations in terms of increasing matrix size. I performed the tests in R2013a.
>> compare_zeros_init
N = 512, ZEROS = 0.001560168
N = 512, GROW = 0.001479991
N = 512, MULT = 0.001457031
N = 1024, ZEROS = 0.005744873
N = 1024, GROW = 0.005352638
N = 1024, MULT = 0.005359236
N = 1536, ZEROS = 0.011950846
N = 1536, GROW = 0.009051589
N = 1536, MULT = 0.008418878
N = 2048, ZEROS = 0.012154002
N = 2048, GROW = 0.010996315
N = 2048, MULT = 0.011002169
N = 2560, ZEROS = 0.017940950
N = 2560, GROW = 0.017641046
N = 2560, MULT = 0.017640323
N = 3072, ZEROS = 0.025657999
N = 3072, GROW = 0.025836506
N = 3072, MULT = 0.051533432
N = 3584, ZEROS = 0.074739924
N = 3584, GROW = 0.070486857
N = 3584, MULT = 0.072822335
N = 4096, ZEROS = 0.098791732
N = 4096, GROW = 0.095849788
N = 4096, MULT = 0.102148452
After doing some research, I've found this article in "Undocumented Matlab", in which Mr. Yair Altman had already come to the conclusion that MathWork's way of preallocating matrices using zeros(M, N) is indeed not the most efficient way.
He timed x = zeros(M,N) vs. clear x, x(M,N) = 0 and found that the latter is ~500 times faster. According to his explanation, the second method simply creates an M-by-N matrix, the elements of which being automatically initialized to 0. The first method however, creates x (with x having automatic zero elements) and then assigns a zero to every element in x again, and that is a redundant operation that takes more time.
In the case of empty matrix multiplication, such as what you've shown in your question, MATLAB expects the product to be an M×N matrix, and therefore it allocates an M×N matrix. Consequently, the output matrix is automatically initialized to zeroes. Since the original matrices are empty, no further calculations are performed, and hence the elements in the output matrix remain unchanged and equal to zero.
Interesting question, apparently there are several ways to 'beat' the built-in zeros function. My only guess as to why this is happening would be that it could be more memory efficient (after all, zeros(LargeNumer) will sooner cause Matlab to hit the memory limit than form a devestating speed bottleneck in most code), or more robust somehow.
Here is another fast allocation method using a sparse matrix, i have added the regular zeros function as a benchmark:
tic; x=zeros(1000,1000); toc
Elapsed time is 0.002863 seconds.
tic; clear x; x(1000,1000)=0; toc
Elapsed time is 0.000282 seconds.
tic; x=full(spalloc(1000,1000,0)); toc
Elapsed time is 0.000273 seconds.
tic; x=spalloc(1000,1000,1000000); toc %Is this the same for practical purposes?
Elapsed time is 0.000281 seconds.
I'm quite new to Matlab and I need help in speeding up some part of my code. I am writing a Matlab application that performs 3D matrix convolution but unlike in standard convolution, the kernel is not constant, it needs to be calculated for each pixel of an image.
So far, I have ended up with a working code, but incredibly slow:
function result = calculateFilteredImages(images, T)
% images - matrix [480,360,10] of 10 grayscale images of height=480 and width=360
% reprezented as a value in a range [0..1]
% i.e. images(10,20,5) = 0.1231;
% T - some matrix [480,360,10, 3,3] of double values, calculated earlier
kerN = 5; %kernel size
mid=floor(kerN/2); %half the kernel size
offset=mid+1; %kernel offset
[h,w,n] = size(images);
%add padding so as not to get IndexOutOfBoundsEx during summation:
%[i.e. changes [1 2 3...10] to [0 0 1 2 ... 10 0 0]]
images = padarray(images,[mid, mid, mid]);
result(h,w,n)=0; %preallocate, faster than zeros(h,w,n)
kernel(kerN,kerN,kerN)=0; %preallocate
% the three parameters below are not important in this problem
% (are used to calculate sigma in x,y,z direction inside the loop)
sigMin=0.5;
sigMax=3;
d = 3;
for a=1:n;
tic;
for b=1:w;
for c=1:h;
M(:,:)=T(c,b,a,:,:); % M is now a 3x3 matrix
[R D] = eig(M); %get eigenvectors and eigenvalues - R and D are now 3x3 matrices
% eigenvalues
l1 = D(1,1);
l2 = D(2,2);
l3 = D(3,3);
sig1=sig( l1 , sigMin, sigMax, d);
sig2=sig( l2 , sigMin, sigMax, d);
sig3=sig( l3 , sigMin, sigMax, d);
% calculate kernel
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
x_new = [i,j,k] * R; %calculate new [i,j,k]
kernel(offset+i, offset+j, offset+k) = exp(- (((x_new(1))^2 )/(sig1^2) + ((x_new(2))^2)/(sig2^2) + ((x_new(3))^2)/(sig3^2)) /2);
end
end
end
% normalize
kernel=kernel/sum(kernel(:));
%perform summation
xm_sum=0;
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
xm_sum = xm_sum + kernel(offset+i, offset+j, offset+k) * images(c+mid+i, b+mid+j, a+mid+k);
end
end
end
result(c,b,a)=xm_sum;
end
end
toc;
end
end
I tried replacing the "calculating kernel" part with
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
but it turned out to be even slower than the loop. I went through several articles and tutorials on vectorization but I'm quite stuck with this one.
Can it be vectorized or somehow speeded up using something else?
I'm new to Matlab, maybe there are some build-in functions that could help in this case?
Update
The profiling result:
Sample data which was used during profiling:
T.mat
grayImages.mat
As Dennis noted, this is a lot of code, cutting it down to the minimum that's slow given by the profiler will help. I'm not sure if my code is equivalent to yours, can you try it and profile it? The 'trick' to Matlab vectorization is using .* and .^, which operate element-by-element instead of having to use loops. http://www.mathworks.com/help/matlab/ref/power.html
Take your rewritten part:
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
And just pick one sigma for now. Looping over 3 different sigmas isn't a performance problem if you can vectorize the underlying k2 formula.
EDIT: Changed the matrix_to_norm code to be x(:), and no commas. See Generate all possible combinations of the elements of some vectors (Cartesian product)
Then try:
% R & mid my test variables
R = [1 2 3; 4 5 6; 7 8 9];
mid = 5;
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
% meshgrid is also a possibility, check that you are getting the order you want
% Going to break the equation apart for now for clarity
% Matrix operation, should already be fast.
matrix_to_norm = [x(:) y(:) z(:)]*R/sig1
% Ditto
matrix_normed = norm(matrix_to_norm)
% Note the .^ - I believe you want element-by-element exponentiation, this will
% vectorize it.
k2 = exp(-0.5*(matrix_normed.^2))
I've stumbled upon the weird way (in my view) that Matlab is dealing with empty matrices. For example, if two empty matrices are multiplied the result is:
zeros(3,0)*zeros(0,3)
ans =
0 0 0
0 0 0
0 0 0
Now, this already took me by surprise, however, a quick search got me to the link above, and I got an explanation of the somewhat twisted logic of why this is happening.
However, nothing prepared me for the following observation. I asked myself, how efficient is this type of multiplication vs just using zeros(n) function, say for the purpose of initialization? I've used timeit to answer this:
f=#() zeros(1000)
timeit(f)
ans =
0.0033
vs:
g=#() zeros(1000,0)*zeros(0,1000)
timeit(g)
ans =
9.2048e-06
Both have the same outcome of 1000x1000 matrix of zeros of class double, but the empty matrix multiplication one is ~350 times faster! (a similar result happens using tic and toc and a loop)
How can this be? are timeit or tic,toc bluffing or have I found a faster way to initialize matrices?
(this was done with matlab 2012a, on a win7-64 machine, intel-i5 650 3.2Ghz...)
EDIT:
After reading your feedback, I have looked more carefully into this peculiarity, and tested on 2 different computers (same matlab ver though 2012a) a code that examine the run time vs the size of matrix n. This is what I get:
The code to generate this used timeit as before, but a loop with tic and toc will look the same. So, for small sizes, zeros(n) is comparable. However, around n=400 there is a jump in performance for the empty matrix multiplication. The code I've used to generate that plot was:
n=unique(round(logspace(0,4,200)));
for k=1:length(n)
f=#() zeros(n(k));
t1(k)=timeit(f);
g=#() zeros(n(k),0)*zeros(0,n(k));
t2(k)=timeit(g);
end
loglog(n,t1,'b',n,t2,'r');
legend('zeros(n)','zeros(n,0)*zeros(0,n)',2);
xlabel('matrix size (n)'); ylabel('time [sec]');
Are any of you experience this too?
EDIT #2:
Incidentally, empty matrix multiplication is not needed to get this effect. One can simply do:
z(n,n)=0;
where n> some threshold matrix size seen in the previous graph, and get the exact efficiency profile as with empty matrix multiplication (again using timeit).
Here's an example where it improves efficiency of a code:
n = 1e4;
clear z1
tic
z1 = zeros( n );
for cc = 1 : n
z1(:,cc)=cc;
end
toc % Elapsed time is 0.445780 seconds.
%%
clear z0
tic
z0 = zeros(n,0)*zeros(0,n);
for cc = 1 : n
z0(:,cc)=cc;
end
toc % Elapsed time is 0.297953 seconds.
However, using z(n,n)=0; instead yields similar results to the zeros(n) case.
This is strange, I am seeing f being faster while g being slower than what you are seeing. But both of them are identical for me. Perhaps a different version of MATLAB ?
>> g = #() zeros(1000, 0) * zeros(0, 1000);
>> f = #() zeros(1000)
f =
#()zeros(1000)
>> timeit(f)
ans =
8.5019e-04
>> timeit(f)
ans =
8.4627e-04
>> timeit(g)
ans =
8.4627e-04
EDIT can you add + 1 for the end of f and g, and see what times you are getting.
EDIT Jan 6, 2013 7:42 EST
I am using a machine remotely, so sorry about the low quality graphs (had to generate them blind).
Machine config:
i7 920. 2.653 GHz. Linux. 12 GB RAM. 8MB cache.
It looks like even the machine I have access to shows this behavior, except at a larger size (somewhere between 1979 and 2073). There is no reason I can think of right now for the empty matrix multiplication to be faster at larger sizes.
I will be investigating a little bit more before coming back.
EDIT Jan 11, 2013
After #EitanT's post, I wanted to do a little bit more of digging. I wrote some C code to see how matlab may be creating a zeros matrix. Here is the c++ code that I used.
int main(int argc, char **argv)
{
for (int i = 1975; i <= 2100; i+=25) {
timer::start();
double *foo = (double *)malloc(i * i * sizeof(double));
for (int k = 0; k < i * i; k++) foo[k] = 0;
double mftime = timer::stop();
free(foo);
timer::start();
double *bar = (double *)malloc(i * i * sizeof(double));
memset(bar, 0, i * i * sizeof(double));
double mmtime = timer::stop();
free(bar);
timer::start();
double *baz = (double *)calloc(i * i, sizeof(double));
double catime = timer::stop();
free(baz);
printf("%d, %lf, %lf, %lf\n", i, mftime, mmtime, catime);
}
}
Here are the results.
$ ./test
1975, 0.013812, 0.013578, 0.003321
2000, 0.014144, 0.013879, 0.003408
2025, 0.014396, 0.014219, 0.003490
2050, 0.014732, 0.013784, 0.000043
2075, 0.015022, 0.014122, 0.000045
2100, 0.014606, 0.014480, 0.000045
As you can see calloc (4th column) seems to be the fastest method. It is also getting significantly faster between 2025 and 2050 (I'd assume it would at around 2048 ?).
Now I went back to matlab to check for the same. Here are the results.
>> test
1975, 0.003296, 0.003297
2000, 0.003377, 0.003385
2025, 0.003465, 0.003464
2050, 0.015987, 0.000019
2075, 0.016373, 0.000019
2100, 0.016762, 0.000020
It looks like both f() and g() are using calloc at smaller sizes (<2048 ?). But at larger sizes f() (zeros(m, n)) starts to use malloc + memset, while g() (zeros(m, 0) * zeros(0, n)) keeps using calloc.
So the divergence is explained by the following
zeros(..) begins to use a different (slower ?) scheme at larger sizes.
calloc also behaves somewhat unexpectedly, leading to an improvement in performance.
This is the behavior on Linux. Can someone do the same experiment on a different machine (and perhaps a different OS) and see if the experiment holds ?
The results might be a bit misleading. When you multiply two empty matrices, the resulting matrix is not immediately "allocated" and "initialized", rather this is postponed until you first use it (sort of like a lazy evaluation).
The same applies when indexing out of bounds to grow a variable, which in the case of numeric arrays fills out any missing entries with zeros (I discuss afterwards the non-numeric case). Of course growing the matrix this way does not overwrite existing elements.
So while it may seem faster, you are just delaying the allocation time until you actually first use the matrix. In the end you'll have similar timings as if you did the allocation from the start.
Example to show this behavior, compared to a few other alternatives:
N = 1000;
clear z
tic, z = zeros(N,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = zeros(N,0)*zeros(0,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(N,N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = full(spalloc(N,N,0)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(1:N,1:N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
val = 0;
tic, z = val(ones(N)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = repmat(0, [N N]); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
The result shows that if you sum the elapsed time for both instructions in each case, you end up with similar total timings:
// zeros(N,N)
Elapsed time is 0.004525 seconds.
Elapsed time is 0.000792 seconds.
// zeros(N,0)*zeros(0,N)
Elapsed time is 0.000052 seconds.
Elapsed time is 0.004365 seconds.
// z(N,N) = 0
Elapsed time is 0.000053 seconds.
Elapsed time is 0.004119 seconds.
The other timings were:
// full(spalloc(N,N,0))
Elapsed time is 0.001463 seconds.
Elapsed time is 0.003751 seconds.
// z(1:N,1:N) = 0
Elapsed time is 0.006820 seconds.
Elapsed time is 0.000647 seconds.
// val(ones(N))
Elapsed time is 0.034880 seconds.
Elapsed time is 0.000911 seconds.
// repmat(0, [N N])
Elapsed time is 0.001320 seconds.
Elapsed time is 0.003749 seconds.
These measurements are too small in the milliseconds and might not be very accurate, so you might wanna run these commands in a loop a few thousand times and take the average. Also sometimes running saved M-functions is faster than running scripts or on the command prompt, as certain optimizations only happen that way...
Either way allocation is usually done once, so who cares if it takes an extra 30ms :)
A similar behavior can be seen with cell arrays or arrays of structures. Consider the following example:
N = 1000;
tic, a = cell(N,N); toc
tic, b = repmat({[]}, [N,N]); toc
tic, c{N,N} = []; toc
which gives:
Elapsed time is 0.001245 seconds.
Elapsed time is 0.040698 seconds.
Elapsed time is 0.004846 seconds.
Note that even if they are all equal, they occupy different amount of memory:
>> assert(isequal(a,b,c))
>> whos a b c
Name Size Bytes Class Attributes
a 1000x1000 8000000 cell
b 1000x1000 112000000 cell
c 1000x1000 8000104 cell
In fact the situation is a bit more complicated here, since MATLAB is probably sharing the same empty matrix for all the cells, rather than creating multiple copies.
The cell array a is in fact an array of uninitialized cells (an array of NULL pointers), while b is a cell array where each cell is an empty array [] (internally and because of data sharing, only the first cell b{1} points to [] while all the rest have a reference to the first cell). The final array c is similar to a (uninitialized cells), but with the last one containing an empty numeric matrix [].
I looked around the list of exported C functions from the libmx.dll (using Dependency Walker tool), and I found a few interesting things.
there are undocumented functions for creating uninitialized arrays: mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, and mxCreateUninitNumericMatrix. In fact there is a submission on the File Exchange makes use of these functions to provide a faster alternative to zeros function.
there exist an undocumented function called mxFastZeros. Googling online, I can see you cross-posted this question on MATLAB Answers as well, with some excellent answers over there. James Tursa (same author of UNINIT from before) gave an example on how to use this undocumented function.
libmx.dll is linked against tbbmalloc.dll shared library. This is Intel TBB scalable memory allocator. This library provides equivalent memory allocation functions (malloc, calloc, free) optimized for parallel applications. Remember that many MATLAB functions are automatically multithreaded, so I wouldn't be surprised if zeros(..) is multithreaded and is using Intel's memory allocator once the matrix size is large enough (here is recent comment by Loren Shure that confirms this fact).
Regarding the last point about the memory allocator, you could write a similar benchmark in C/C++ similar to what #PavanYalamanchili did, and compare the various allocators available. Something like this. Remember that MEX-files have a slightly higher memory management overhead, since MATLAB automatically frees any memory that was allocated in MEX-files using the mxCalloc, mxMalloc, or mxRealloc functions. For what it's worth, it used to be possible to change the internal memory manager in older versions.
EDIT:
Here is a more thorough benchmark to compare the discussed alternatives. It specifically shows that once you stress the use of the entire allocated matrix, all three methods are on equal footing, and the difference is negligible.
function compare_zeros_init()
iter = 100;
for N = 512.*(1:8)
% ZEROS(N,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, ZEROS = %.9f\n', N, mean(sum(t,2)))
% z(N,N)=0
t = zeros(iter,3);
for i=1:iter
clear z
tic, z(N,N) = 0; t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, GROW = %.9f\n', N, mean(sum(t,2)))
% ZEROS(N,0)*ZEROS(0,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,0)*zeros(0,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, MULT = %.9f\n\n', N, mean(sum(t,2)))
end
end
Below are the timings averaged over 100 iterations in terms of increasing matrix size. I performed the tests in R2013a.
>> compare_zeros_init
N = 512, ZEROS = 0.001560168
N = 512, GROW = 0.001479991
N = 512, MULT = 0.001457031
N = 1024, ZEROS = 0.005744873
N = 1024, GROW = 0.005352638
N = 1024, MULT = 0.005359236
N = 1536, ZEROS = 0.011950846
N = 1536, GROW = 0.009051589
N = 1536, MULT = 0.008418878
N = 2048, ZEROS = 0.012154002
N = 2048, GROW = 0.010996315
N = 2048, MULT = 0.011002169
N = 2560, ZEROS = 0.017940950
N = 2560, GROW = 0.017641046
N = 2560, MULT = 0.017640323
N = 3072, ZEROS = 0.025657999
N = 3072, GROW = 0.025836506
N = 3072, MULT = 0.051533432
N = 3584, ZEROS = 0.074739924
N = 3584, GROW = 0.070486857
N = 3584, MULT = 0.072822335
N = 4096, ZEROS = 0.098791732
N = 4096, GROW = 0.095849788
N = 4096, MULT = 0.102148452
After doing some research, I've found this article in "Undocumented Matlab", in which Mr. Yair Altman had already come to the conclusion that MathWork's way of preallocating matrices using zeros(M, N) is indeed not the most efficient way.
He timed x = zeros(M,N) vs. clear x, x(M,N) = 0 and found that the latter is ~500 times faster. According to his explanation, the second method simply creates an M-by-N matrix, the elements of which being automatically initialized to 0. The first method however, creates x (with x having automatic zero elements) and then assigns a zero to every element in x again, and that is a redundant operation that takes more time.
In the case of empty matrix multiplication, such as what you've shown in your question, MATLAB expects the product to be an M×N matrix, and therefore it allocates an M×N matrix. Consequently, the output matrix is automatically initialized to zeroes. Since the original matrices are empty, no further calculations are performed, and hence the elements in the output matrix remain unchanged and equal to zero.
Interesting question, apparently there are several ways to 'beat' the built-in zeros function. My only guess as to why this is happening would be that it could be more memory efficient (after all, zeros(LargeNumer) will sooner cause Matlab to hit the memory limit than form a devestating speed bottleneck in most code), or more robust somehow.
Here is another fast allocation method using a sparse matrix, i have added the regular zeros function as a benchmark:
tic; x=zeros(1000,1000); toc
Elapsed time is 0.002863 seconds.
tic; clear x; x(1000,1000)=0; toc
Elapsed time is 0.000282 seconds.
tic; x=full(spalloc(1000,1000,0)); toc
Elapsed time is 0.000273 seconds.
tic; x=spalloc(1000,1000,1000000); toc %Is this the same for practical purposes?
Elapsed time is 0.000281 seconds.