Related
Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.
Here is my current implementation,
using Distributions
using SparseArrays
function kmer_profile(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
if mask[i]
kmer_hash += d[seq[n+i-1]] * basis[j]
j += 1
end
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
return sparsevec(kmer_dict)
end
seq = join(sample(['A','C','G','T'], 1000000))
mask_str = "111111011111001111111111111110"
mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)
#time kmer_profile(seq, k, mask)
This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?
The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.
To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.
The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.
There are several possible optimizations to make this code faster.
First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.
Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.
Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.
Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.
Here is the final optimized code:
function kmer_profile_opt(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
sorted_kmer_pairs = sort(collect(kmer_dict))
sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end
This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.
The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).
Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:
function kmer_profile_opt3a(seq, k, mask)
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
basis = [4^i for i in (k-1):-1:0]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_vec = Vector{Int}(undef, length(seq)-length(mask)+1)
#inbounds for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
kmer_vec[n] = kmer_hash
end
sort!(kmer_vec)
return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k, +)
end
This achieved another 2x over Jérôme's answer on my machine.
The auto-combining feature of sparsevec makes the code a bit more compact.
Trying to slim the code further, and avoid unnecessary allocations in sparse vector creation, the following can be used:
using SparseArrays, LinearAlgebra
function specialsparsevec(nzs, n)
vals = Vector{Int}(undef, length(nzs))
j, k, count, last = (1, 1, 0, nzs[1])
while k <= length(nzs)
if nzs[k] == last
count += 1
else
vals[j], nzs[j] = (count, last)
count, last = (1, nzs[k])
j += 1
end
k += 1
end
vals[j], nzs[j] = (count, last)
resize!(nzs, j)
resize!(vals, j)
return SparseVector(n, nzs, vals)
end
function kmer_profile_opt3(seq, k, mask)
d = zeros(Int8, 128)
foreach(((i,c),) -> d[Int(c)]=i-1, enumerate(collect("ACGT")))
seq_codes = getindex.(Ref(d), Int8.(collect(seq)))
premult = foldr(
(i,(p,j))->(mask[i] && (p[i]=j ; j<<=2) ; (p,j)),
1:length(mask); init=(zeros(Int64,length(mask)),1)) |> first
kmer_vec = sort(
[ dot(#view(seq_codes[n:n+length(mask)-1]),premult) + 1 for
n in 1:(length(seq)-length(mask)+1)
])
return specialsparsevec(kmer_vec, 4^k)
end
This last version gets another 10% speedup (but is a little cryptic):
julia> #btime kmer_profile_opt($seq, $k, $mask);
367.584 ms (81 allocations: 134.71 MiB) # other answer
julia> #btime kmer_profile_opt3a($seq, $k, $mask);
140.882 ms (22 allocations: 54.36 MiB) # 1st this answer
julia> #btime kmer_profile_opt3($seq, $k, $mask);
127.016 ms (14 allocations: 27.66 MiB) # 2nd this answer
I've stumbled upon the weird way (in my view) that Matlab is dealing with empty matrices. For example, if two empty matrices are multiplied the result is:
zeros(3,0)*zeros(0,3)
ans =
0 0 0
0 0 0
0 0 0
Now, this already took me by surprise, however, a quick search got me to the link above, and I got an explanation of the somewhat twisted logic of why this is happening.
However, nothing prepared me for the following observation. I asked myself, how efficient is this type of multiplication vs just using zeros(n) function, say for the purpose of initialization? I've used timeit to answer this:
f=#() zeros(1000)
timeit(f)
ans =
0.0033
vs:
g=#() zeros(1000,0)*zeros(0,1000)
timeit(g)
ans =
9.2048e-06
Both have the same outcome of 1000x1000 matrix of zeros of class double, but the empty matrix multiplication one is ~350 times faster! (a similar result happens using tic and toc and a loop)
How can this be? are timeit or tic,toc bluffing or have I found a faster way to initialize matrices?
(this was done with matlab 2012a, on a win7-64 machine, intel-i5 650 3.2Ghz...)
EDIT:
After reading your feedback, I have looked more carefully into this peculiarity, and tested on 2 different computers (same matlab ver though 2012a) a code that examine the run time vs the size of matrix n. This is what I get:
The code to generate this used timeit as before, but a loop with tic and toc will look the same. So, for small sizes, zeros(n) is comparable. However, around n=400 there is a jump in performance for the empty matrix multiplication. The code I've used to generate that plot was:
n=unique(round(logspace(0,4,200)));
for k=1:length(n)
f=#() zeros(n(k));
t1(k)=timeit(f);
g=#() zeros(n(k),0)*zeros(0,n(k));
t2(k)=timeit(g);
end
loglog(n,t1,'b',n,t2,'r');
legend('zeros(n)','zeros(n,0)*zeros(0,n)',2);
xlabel('matrix size (n)'); ylabel('time [sec]');
Are any of you experience this too?
EDIT #2:
Incidentally, empty matrix multiplication is not needed to get this effect. One can simply do:
z(n,n)=0;
where n> some threshold matrix size seen in the previous graph, and get the exact efficiency profile as with empty matrix multiplication (again using timeit).
Here's an example where it improves efficiency of a code:
n = 1e4;
clear z1
tic
z1 = zeros( n );
for cc = 1 : n
z1(:,cc)=cc;
end
toc % Elapsed time is 0.445780 seconds.
%%
clear z0
tic
z0 = zeros(n,0)*zeros(0,n);
for cc = 1 : n
z0(:,cc)=cc;
end
toc % Elapsed time is 0.297953 seconds.
However, using z(n,n)=0; instead yields similar results to the zeros(n) case.
This is strange, I am seeing f being faster while g being slower than what you are seeing. But both of them are identical for me. Perhaps a different version of MATLAB ?
>> g = #() zeros(1000, 0) * zeros(0, 1000);
>> f = #() zeros(1000)
f =
#()zeros(1000)
>> timeit(f)
ans =
8.5019e-04
>> timeit(f)
ans =
8.4627e-04
>> timeit(g)
ans =
8.4627e-04
EDIT can you add + 1 for the end of f and g, and see what times you are getting.
EDIT Jan 6, 2013 7:42 EST
I am using a machine remotely, so sorry about the low quality graphs (had to generate them blind).
Machine config:
i7 920. 2.653 GHz. Linux. 12 GB RAM. 8MB cache.
It looks like even the machine I have access to shows this behavior, except at a larger size (somewhere between 1979 and 2073). There is no reason I can think of right now for the empty matrix multiplication to be faster at larger sizes.
I will be investigating a little bit more before coming back.
EDIT Jan 11, 2013
After #EitanT's post, I wanted to do a little bit more of digging. I wrote some C code to see how matlab may be creating a zeros matrix. Here is the c++ code that I used.
int main(int argc, char **argv)
{
for (int i = 1975; i <= 2100; i+=25) {
timer::start();
double *foo = (double *)malloc(i * i * sizeof(double));
for (int k = 0; k < i * i; k++) foo[k] = 0;
double mftime = timer::stop();
free(foo);
timer::start();
double *bar = (double *)malloc(i * i * sizeof(double));
memset(bar, 0, i * i * sizeof(double));
double mmtime = timer::stop();
free(bar);
timer::start();
double *baz = (double *)calloc(i * i, sizeof(double));
double catime = timer::stop();
free(baz);
printf("%d, %lf, %lf, %lf\n", i, mftime, mmtime, catime);
}
}
Here are the results.
$ ./test
1975, 0.013812, 0.013578, 0.003321
2000, 0.014144, 0.013879, 0.003408
2025, 0.014396, 0.014219, 0.003490
2050, 0.014732, 0.013784, 0.000043
2075, 0.015022, 0.014122, 0.000045
2100, 0.014606, 0.014480, 0.000045
As you can see calloc (4th column) seems to be the fastest method. It is also getting significantly faster between 2025 and 2050 (I'd assume it would at around 2048 ?).
Now I went back to matlab to check for the same. Here are the results.
>> test
1975, 0.003296, 0.003297
2000, 0.003377, 0.003385
2025, 0.003465, 0.003464
2050, 0.015987, 0.000019
2075, 0.016373, 0.000019
2100, 0.016762, 0.000020
It looks like both f() and g() are using calloc at smaller sizes (<2048 ?). But at larger sizes f() (zeros(m, n)) starts to use malloc + memset, while g() (zeros(m, 0) * zeros(0, n)) keeps using calloc.
So the divergence is explained by the following
zeros(..) begins to use a different (slower ?) scheme at larger sizes.
calloc also behaves somewhat unexpectedly, leading to an improvement in performance.
This is the behavior on Linux. Can someone do the same experiment on a different machine (and perhaps a different OS) and see if the experiment holds ?
The results might be a bit misleading. When you multiply two empty matrices, the resulting matrix is not immediately "allocated" and "initialized", rather this is postponed until you first use it (sort of like a lazy evaluation).
The same applies when indexing out of bounds to grow a variable, which in the case of numeric arrays fills out any missing entries with zeros (I discuss afterwards the non-numeric case). Of course growing the matrix this way does not overwrite existing elements.
So while it may seem faster, you are just delaying the allocation time until you actually first use the matrix. In the end you'll have similar timings as if you did the allocation from the start.
Example to show this behavior, compared to a few other alternatives:
N = 1000;
clear z
tic, z = zeros(N,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = zeros(N,0)*zeros(0,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(N,N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = full(spalloc(N,N,0)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(1:N,1:N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
val = 0;
tic, z = val(ones(N)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = repmat(0, [N N]); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
The result shows that if you sum the elapsed time for both instructions in each case, you end up with similar total timings:
// zeros(N,N)
Elapsed time is 0.004525 seconds.
Elapsed time is 0.000792 seconds.
// zeros(N,0)*zeros(0,N)
Elapsed time is 0.000052 seconds.
Elapsed time is 0.004365 seconds.
// z(N,N) = 0
Elapsed time is 0.000053 seconds.
Elapsed time is 0.004119 seconds.
The other timings were:
// full(spalloc(N,N,0))
Elapsed time is 0.001463 seconds.
Elapsed time is 0.003751 seconds.
// z(1:N,1:N) = 0
Elapsed time is 0.006820 seconds.
Elapsed time is 0.000647 seconds.
// val(ones(N))
Elapsed time is 0.034880 seconds.
Elapsed time is 0.000911 seconds.
// repmat(0, [N N])
Elapsed time is 0.001320 seconds.
Elapsed time is 0.003749 seconds.
These measurements are too small in the milliseconds and might not be very accurate, so you might wanna run these commands in a loop a few thousand times and take the average. Also sometimes running saved M-functions is faster than running scripts or on the command prompt, as certain optimizations only happen that way...
Either way allocation is usually done once, so who cares if it takes an extra 30ms :)
A similar behavior can be seen with cell arrays or arrays of structures. Consider the following example:
N = 1000;
tic, a = cell(N,N); toc
tic, b = repmat({[]}, [N,N]); toc
tic, c{N,N} = []; toc
which gives:
Elapsed time is 0.001245 seconds.
Elapsed time is 0.040698 seconds.
Elapsed time is 0.004846 seconds.
Note that even if they are all equal, they occupy different amount of memory:
>> assert(isequal(a,b,c))
>> whos a b c
Name Size Bytes Class Attributes
a 1000x1000 8000000 cell
b 1000x1000 112000000 cell
c 1000x1000 8000104 cell
In fact the situation is a bit more complicated here, since MATLAB is probably sharing the same empty matrix for all the cells, rather than creating multiple copies.
The cell array a is in fact an array of uninitialized cells (an array of NULL pointers), while b is a cell array where each cell is an empty array [] (internally and because of data sharing, only the first cell b{1} points to [] while all the rest have a reference to the first cell). The final array c is similar to a (uninitialized cells), but with the last one containing an empty numeric matrix [].
I looked around the list of exported C functions from the libmx.dll (using Dependency Walker tool), and I found a few interesting things.
there are undocumented functions for creating uninitialized arrays: mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, and mxCreateUninitNumericMatrix. In fact there is a submission on the File Exchange makes use of these functions to provide a faster alternative to zeros function.
there exist an undocumented function called mxFastZeros. Googling online, I can see you cross-posted this question on MATLAB Answers as well, with some excellent answers over there. James Tursa (same author of UNINIT from before) gave an example on how to use this undocumented function.
libmx.dll is linked against tbbmalloc.dll shared library. This is Intel TBB scalable memory allocator. This library provides equivalent memory allocation functions (malloc, calloc, free) optimized for parallel applications. Remember that many MATLAB functions are automatically multithreaded, so I wouldn't be surprised if zeros(..) is multithreaded and is using Intel's memory allocator once the matrix size is large enough (here is recent comment by Loren Shure that confirms this fact).
Regarding the last point about the memory allocator, you could write a similar benchmark in C/C++ similar to what #PavanYalamanchili did, and compare the various allocators available. Something like this. Remember that MEX-files have a slightly higher memory management overhead, since MATLAB automatically frees any memory that was allocated in MEX-files using the mxCalloc, mxMalloc, or mxRealloc functions. For what it's worth, it used to be possible to change the internal memory manager in older versions.
EDIT:
Here is a more thorough benchmark to compare the discussed alternatives. It specifically shows that once you stress the use of the entire allocated matrix, all three methods are on equal footing, and the difference is negligible.
function compare_zeros_init()
iter = 100;
for N = 512.*(1:8)
% ZEROS(N,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, ZEROS = %.9f\n', N, mean(sum(t,2)))
% z(N,N)=0
t = zeros(iter,3);
for i=1:iter
clear z
tic, z(N,N) = 0; t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, GROW = %.9f\n', N, mean(sum(t,2)))
% ZEROS(N,0)*ZEROS(0,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,0)*zeros(0,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, MULT = %.9f\n\n', N, mean(sum(t,2)))
end
end
Below are the timings averaged over 100 iterations in terms of increasing matrix size. I performed the tests in R2013a.
>> compare_zeros_init
N = 512, ZEROS = 0.001560168
N = 512, GROW = 0.001479991
N = 512, MULT = 0.001457031
N = 1024, ZEROS = 0.005744873
N = 1024, GROW = 0.005352638
N = 1024, MULT = 0.005359236
N = 1536, ZEROS = 0.011950846
N = 1536, GROW = 0.009051589
N = 1536, MULT = 0.008418878
N = 2048, ZEROS = 0.012154002
N = 2048, GROW = 0.010996315
N = 2048, MULT = 0.011002169
N = 2560, ZEROS = 0.017940950
N = 2560, GROW = 0.017641046
N = 2560, MULT = 0.017640323
N = 3072, ZEROS = 0.025657999
N = 3072, GROW = 0.025836506
N = 3072, MULT = 0.051533432
N = 3584, ZEROS = 0.074739924
N = 3584, GROW = 0.070486857
N = 3584, MULT = 0.072822335
N = 4096, ZEROS = 0.098791732
N = 4096, GROW = 0.095849788
N = 4096, MULT = 0.102148452
After doing some research, I've found this article in "Undocumented Matlab", in which Mr. Yair Altman had already come to the conclusion that MathWork's way of preallocating matrices using zeros(M, N) is indeed not the most efficient way.
He timed x = zeros(M,N) vs. clear x, x(M,N) = 0 and found that the latter is ~500 times faster. According to his explanation, the second method simply creates an M-by-N matrix, the elements of which being automatically initialized to 0. The first method however, creates x (with x having automatic zero elements) and then assigns a zero to every element in x again, and that is a redundant operation that takes more time.
In the case of empty matrix multiplication, such as what you've shown in your question, MATLAB expects the product to be an M×N matrix, and therefore it allocates an M×N matrix. Consequently, the output matrix is automatically initialized to zeroes. Since the original matrices are empty, no further calculations are performed, and hence the elements in the output matrix remain unchanged and equal to zero.
Interesting question, apparently there are several ways to 'beat' the built-in zeros function. My only guess as to why this is happening would be that it could be more memory efficient (after all, zeros(LargeNumer) will sooner cause Matlab to hit the memory limit than form a devestating speed bottleneck in most code), or more robust somehow.
Here is another fast allocation method using a sparse matrix, i have added the regular zeros function as a benchmark:
tic; x=zeros(1000,1000); toc
Elapsed time is 0.002863 seconds.
tic; clear x; x(1000,1000)=0; toc
Elapsed time is 0.000282 seconds.
tic; x=full(spalloc(1000,1000,0)); toc
Elapsed time is 0.000273 seconds.
tic; x=spalloc(1000,1000,1000000); toc %Is this the same for practical purposes?
Elapsed time is 0.000281 seconds.
I wrote a function to calculate the joint entropy of each column pair in a matrix. But I would like to increase the performance regarding time and memory.
The function looks like this:
function jointentropy(aln)
mat = Array(Float64,size(aln,2),size(aln,2))
for i in combinations(1:size(aln,2),2)
a = i[1]
b = i[2]
mina, maxa = extrema(aln[:,a])
minb, maxb = extrema(aln[:,b])
h = Array(Float64,(maxa-mina+1,maxb-minb+1))
h = hist2d([aln[:,a] aln[:,b]],mina-1:1:maxa,minb-1:1:maxb)[3]
h = h/size(aln[:,1],1)
I,J,V = findnz(h)
l = sparse(I,J,log2(V),maxa-mina+1,maxb-minb+1)
mat[b,a] = - sum(l.*h)
end
return mat
end
Matrices that go into this function look like this:
rand(45:122,rand(1:2000),rand(1:2000))
An example with a 500x500 matrix resulted in the following #time output:
elapsed time: 33.692081413 seconds (33938843192 bytes allocated, 36.42% gc time)
...which seems to be a whole lot of memory...
Any suggestions on how to speed up this function and reduce memory allocation?
Thanks in advance for any help!
Here are a few ideas to speed up your function.
If the range of all the columns is roughly the same, you can move the extrema computations outside the loop and reuse the same h array.
hist2d creates a new array: you can use hist2d! to reuse the previous one.
The assignment h = h/size(aln[:,1],1) creates a new array.
The division in h = h/size(aln[:,1],1) is done for all the elements of the array, including the zeroes.
You can use a loop instead of findnz and a sparse matrix (findnz already contains a loop).
.
function jointentropy2(aln)
n1 = size(aln,1)
n2 = size(aln,2)
mat = Array(Float64,n2,n2)
lower, upper = extrema(aln)
m = upper-lower+1
h = Array(Float64,(m,m))
for a in 1:n2
for b in (a+1):n2
Base.hist2d!(h,[aln[:,a] aln[:,b]],lower-1:1:upper,lower-1:1:upper)[3]
s = 0
for i in 1:m
for j in 1:m
if h[i,j] != 0
p = h[i,j] / n1
s += p * log2(p)
end
end
end
mat[b,a] = - s
end
end
return mat
end
This is twice as fast as the initial function,
and the memory allocations were divided by 4.
aln = rand(45:122,500,400)
#time x = jointentropy(aln)
# elapsed time: 26.946314168 seconds (21697858752 bytes allocated, 29.97% gc time)
#time y = jointentropy2(aln)
# elapsed time: 13.626282821 seconds (5087119968 bytes allocated, 16.21% gc time)
x - y # approximately zero (at least below the diagonal --
# the matrix was not initialized above it)
The next candidate for optimization is hist2d (here, you could use a loop and a sparse matrix).
#profile jointentropy2(aln)
Profile.print()
I've stumbled upon the weird way (in my view) that Matlab is dealing with empty matrices. For example, if two empty matrices are multiplied the result is:
zeros(3,0)*zeros(0,3)
ans =
0 0 0
0 0 0
0 0 0
Now, this already took me by surprise, however, a quick search got me to the link above, and I got an explanation of the somewhat twisted logic of why this is happening.
However, nothing prepared me for the following observation. I asked myself, how efficient is this type of multiplication vs just using zeros(n) function, say for the purpose of initialization? I've used timeit to answer this:
f=#() zeros(1000)
timeit(f)
ans =
0.0033
vs:
g=#() zeros(1000,0)*zeros(0,1000)
timeit(g)
ans =
9.2048e-06
Both have the same outcome of 1000x1000 matrix of zeros of class double, but the empty matrix multiplication one is ~350 times faster! (a similar result happens using tic and toc and a loop)
How can this be? are timeit or tic,toc bluffing or have I found a faster way to initialize matrices?
(this was done with matlab 2012a, on a win7-64 machine, intel-i5 650 3.2Ghz...)
EDIT:
After reading your feedback, I have looked more carefully into this peculiarity, and tested on 2 different computers (same matlab ver though 2012a) a code that examine the run time vs the size of matrix n. This is what I get:
The code to generate this used timeit as before, but a loop with tic and toc will look the same. So, for small sizes, zeros(n) is comparable. However, around n=400 there is a jump in performance for the empty matrix multiplication. The code I've used to generate that plot was:
n=unique(round(logspace(0,4,200)));
for k=1:length(n)
f=#() zeros(n(k));
t1(k)=timeit(f);
g=#() zeros(n(k),0)*zeros(0,n(k));
t2(k)=timeit(g);
end
loglog(n,t1,'b',n,t2,'r');
legend('zeros(n)','zeros(n,0)*zeros(0,n)',2);
xlabel('matrix size (n)'); ylabel('time [sec]');
Are any of you experience this too?
EDIT #2:
Incidentally, empty matrix multiplication is not needed to get this effect. One can simply do:
z(n,n)=0;
where n> some threshold matrix size seen in the previous graph, and get the exact efficiency profile as with empty matrix multiplication (again using timeit).
Here's an example where it improves efficiency of a code:
n = 1e4;
clear z1
tic
z1 = zeros( n );
for cc = 1 : n
z1(:,cc)=cc;
end
toc % Elapsed time is 0.445780 seconds.
%%
clear z0
tic
z0 = zeros(n,0)*zeros(0,n);
for cc = 1 : n
z0(:,cc)=cc;
end
toc % Elapsed time is 0.297953 seconds.
However, using z(n,n)=0; instead yields similar results to the zeros(n) case.
This is strange, I am seeing f being faster while g being slower than what you are seeing. But both of them are identical for me. Perhaps a different version of MATLAB ?
>> g = #() zeros(1000, 0) * zeros(0, 1000);
>> f = #() zeros(1000)
f =
#()zeros(1000)
>> timeit(f)
ans =
8.5019e-04
>> timeit(f)
ans =
8.4627e-04
>> timeit(g)
ans =
8.4627e-04
EDIT can you add + 1 for the end of f and g, and see what times you are getting.
EDIT Jan 6, 2013 7:42 EST
I am using a machine remotely, so sorry about the low quality graphs (had to generate them blind).
Machine config:
i7 920. 2.653 GHz. Linux. 12 GB RAM. 8MB cache.
It looks like even the machine I have access to shows this behavior, except at a larger size (somewhere between 1979 and 2073). There is no reason I can think of right now for the empty matrix multiplication to be faster at larger sizes.
I will be investigating a little bit more before coming back.
EDIT Jan 11, 2013
After #EitanT's post, I wanted to do a little bit more of digging. I wrote some C code to see how matlab may be creating a zeros matrix. Here is the c++ code that I used.
int main(int argc, char **argv)
{
for (int i = 1975; i <= 2100; i+=25) {
timer::start();
double *foo = (double *)malloc(i * i * sizeof(double));
for (int k = 0; k < i * i; k++) foo[k] = 0;
double mftime = timer::stop();
free(foo);
timer::start();
double *bar = (double *)malloc(i * i * sizeof(double));
memset(bar, 0, i * i * sizeof(double));
double mmtime = timer::stop();
free(bar);
timer::start();
double *baz = (double *)calloc(i * i, sizeof(double));
double catime = timer::stop();
free(baz);
printf("%d, %lf, %lf, %lf\n", i, mftime, mmtime, catime);
}
}
Here are the results.
$ ./test
1975, 0.013812, 0.013578, 0.003321
2000, 0.014144, 0.013879, 0.003408
2025, 0.014396, 0.014219, 0.003490
2050, 0.014732, 0.013784, 0.000043
2075, 0.015022, 0.014122, 0.000045
2100, 0.014606, 0.014480, 0.000045
As you can see calloc (4th column) seems to be the fastest method. It is also getting significantly faster between 2025 and 2050 (I'd assume it would at around 2048 ?).
Now I went back to matlab to check for the same. Here are the results.
>> test
1975, 0.003296, 0.003297
2000, 0.003377, 0.003385
2025, 0.003465, 0.003464
2050, 0.015987, 0.000019
2075, 0.016373, 0.000019
2100, 0.016762, 0.000020
It looks like both f() and g() are using calloc at smaller sizes (<2048 ?). But at larger sizes f() (zeros(m, n)) starts to use malloc + memset, while g() (zeros(m, 0) * zeros(0, n)) keeps using calloc.
So the divergence is explained by the following
zeros(..) begins to use a different (slower ?) scheme at larger sizes.
calloc also behaves somewhat unexpectedly, leading to an improvement in performance.
This is the behavior on Linux. Can someone do the same experiment on a different machine (and perhaps a different OS) and see if the experiment holds ?
The results might be a bit misleading. When you multiply two empty matrices, the resulting matrix is not immediately "allocated" and "initialized", rather this is postponed until you first use it (sort of like a lazy evaluation).
The same applies when indexing out of bounds to grow a variable, which in the case of numeric arrays fills out any missing entries with zeros (I discuss afterwards the non-numeric case). Of course growing the matrix this way does not overwrite existing elements.
So while it may seem faster, you are just delaying the allocation time until you actually first use the matrix. In the end you'll have similar timings as if you did the allocation from the start.
Example to show this behavior, compared to a few other alternatives:
N = 1000;
clear z
tic, z = zeros(N,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = zeros(N,0)*zeros(0,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(N,N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = full(spalloc(N,N,0)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z(1:N,1:N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
val = 0;
tic, z = val(ones(N)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
clear z
tic, z = repmat(0, [N N]); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))
The result shows that if you sum the elapsed time for both instructions in each case, you end up with similar total timings:
// zeros(N,N)
Elapsed time is 0.004525 seconds.
Elapsed time is 0.000792 seconds.
// zeros(N,0)*zeros(0,N)
Elapsed time is 0.000052 seconds.
Elapsed time is 0.004365 seconds.
// z(N,N) = 0
Elapsed time is 0.000053 seconds.
Elapsed time is 0.004119 seconds.
The other timings were:
// full(spalloc(N,N,0))
Elapsed time is 0.001463 seconds.
Elapsed time is 0.003751 seconds.
// z(1:N,1:N) = 0
Elapsed time is 0.006820 seconds.
Elapsed time is 0.000647 seconds.
// val(ones(N))
Elapsed time is 0.034880 seconds.
Elapsed time is 0.000911 seconds.
// repmat(0, [N N])
Elapsed time is 0.001320 seconds.
Elapsed time is 0.003749 seconds.
These measurements are too small in the milliseconds and might not be very accurate, so you might wanna run these commands in a loop a few thousand times and take the average. Also sometimes running saved M-functions is faster than running scripts or on the command prompt, as certain optimizations only happen that way...
Either way allocation is usually done once, so who cares if it takes an extra 30ms :)
A similar behavior can be seen with cell arrays or arrays of structures. Consider the following example:
N = 1000;
tic, a = cell(N,N); toc
tic, b = repmat({[]}, [N,N]); toc
tic, c{N,N} = []; toc
which gives:
Elapsed time is 0.001245 seconds.
Elapsed time is 0.040698 seconds.
Elapsed time is 0.004846 seconds.
Note that even if they are all equal, they occupy different amount of memory:
>> assert(isequal(a,b,c))
>> whos a b c
Name Size Bytes Class Attributes
a 1000x1000 8000000 cell
b 1000x1000 112000000 cell
c 1000x1000 8000104 cell
In fact the situation is a bit more complicated here, since MATLAB is probably sharing the same empty matrix for all the cells, rather than creating multiple copies.
The cell array a is in fact an array of uninitialized cells (an array of NULL pointers), while b is a cell array where each cell is an empty array [] (internally and because of data sharing, only the first cell b{1} points to [] while all the rest have a reference to the first cell). The final array c is similar to a (uninitialized cells), but with the last one containing an empty numeric matrix [].
I looked around the list of exported C functions from the libmx.dll (using Dependency Walker tool), and I found a few interesting things.
there are undocumented functions for creating uninitialized arrays: mxCreateUninitDoubleMatrix, mxCreateUninitNumericArray, and mxCreateUninitNumericMatrix. In fact there is a submission on the File Exchange makes use of these functions to provide a faster alternative to zeros function.
there exist an undocumented function called mxFastZeros. Googling online, I can see you cross-posted this question on MATLAB Answers as well, with some excellent answers over there. James Tursa (same author of UNINIT from before) gave an example on how to use this undocumented function.
libmx.dll is linked against tbbmalloc.dll shared library. This is Intel TBB scalable memory allocator. This library provides equivalent memory allocation functions (malloc, calloc, free) optimized for parallel applications. Remember that many MATLAB functions are automatically multithreaded, so I wouldn't be surprised if zeros(..) is multithreaded and is using Intel's memory allocator once the matrix size is large enough (here is recent comment by Loren Shure that confirms this fact).
Regarding the last point about the memory allocator, you could write a similar benchmark in C/C++ similar to what #PavanYalamanchili did, and compare the various allocators available. Something like this. Remember that MEX-files have a slightly higher memory management overhead, since MATLAB automatically frees any memory that was allocated in MEX-files using the mxCalloc, mxMalloc, or mxRealloc functions. For what it's worth, it used to be possible to change the internal memory manager in older versions.
EDIT:
Here is a more thorough benchmark to compare the discussed alternatives. It specifically shows that once you stress the use of the entire allocated matrix, all three methods are on equal footing, and the difference is negligible.
function compare_zeros_init()
iter = 100;
for N = 512.*(1:8)
% ZEROS(N,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, ZEROS = %.9f\n', N, mean(sum(t,2)))
% z(N,N)=0
t = zeros(iter,3);
for i=1:iter
clear z
tic, z(N,N) = 0; t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, GROW = %.9f\n', N, mean(sum(t,2)))
% ZEROS(N,0)*ZEROS(0,N)
t = zeros(iter,3);
for i=1:iter
clear z
tic, z = zeros(N,0)*zeros(0,N); t(i,1) = toc;
tic, z(:) = 9; t(i,2) = toc;
tic, z = z + 1; t(i,3) = toc;
end
fprintf('N = %4d, MULT = %.9f\n\n', N, mean(sum(t,2)))
end
end
Below are the timings averaged over 100 iterations in terms of increasing matrix size. I performed the tests in R2013a.
>> compare_zeros_init
N = 512, ZEROS = 0.001560168
N = 512, GROW = 0.001479991
N = 512, MULT = 0.001457031
N = 1024, ZEROS = 0.005744873
N = 1024, GROW = 0.005352638
N = 1024, MULT = 0.005359236
N = 1536, ZEROS = 0.011950846
N = 1536, GROW = 0.009051589
N = 1536, MULT = 0.008418878
N = 2048, ZEROS = 0.012154002
N = 2048, GROW = 0.010996315
N = 2048, MULT = 0.011002169
N = 2560, ZEROS = 0.017940950
N = 2560, GROW = 0.017641046
N = 2560, MULT = 0.017640323
N = 3072, ZEROS = 0.025657999
N = 3072, GROW = 0.025836506
N = 3072, MULT = 0.051533432
N = 3584, ZEROS = 0.074739924
N = 3584, GROW = 0.070486857
N = 3584, MULT = 0.072822335
N = 4096, ZEROS = 0.098791732
N = 4096, GROW = 0.095849788
N = 4096, MULT = 0.102148452
After doing some research, I've found this article in "Undocumented Matlab", in which Mr. Yair Altman had already come to the conclusion that MathWork's way of preallocating matrices using zeros(M, N) is indeed not the most efficient way.
He timed x = zeros(M,N) vs. clear x, x(M,N) = 0 and found that the latter is ~500 times faster. According to his explanation, the second method simply creates an M-by-N matrix, the elements of which being automatically initialized to 0. The first method however, creates x (with x having automatic zero elements) and then assigns a zero to every element in x again, and that is a redundant operation that takes more time.
In the case of empty matrix multiplication, such as what you've shown in your question, MATLAB expects the product to be an M×N matrix, and therefore it allocates an M×N matrix. Consequently, the output matrix is automatically initialized to zeroes. Since the original matrices are empty, no further calculations are performed, and hence the elements in the output matrix remain unchanged and equal to zero.
Interesting question, apparently there are several ways to 'beat' the built-in zeros function. My only guess as to why this is happening would be that it could be more memory efficient (after all, zeros(LargeNumer) will sooner cause Matlab to hit the memory limit than form a devestating speed bottleneck in most code), or more robust somehow.
Here is another fast allocation method using a sparse matrix, i have added the regular zeros function as a benchmark:
tic; x=zeros(1000,1000); toc
Elapsed time is 0.002863 seconds.
tic; clear x; x(1000,1000)=0; toc
Elapsed time is 0.000282 seconds.
tic; x=full(spalloc(1000,1000,0)); toc
Elapsed time is 0.000273 seconds.
tic; x=spalloc(1000,1000,1000000); toc %Is this the same for practical purposes?
Elapsed time is 0.000281 seconds.
I have 2 input variables:
a vector of p-values (p) with N elements (unsorted)
and N x M matrix with p-values obtained by random permutations (pr) with M iterations. N is quite large, 10K to 100K or more. M let's say 100.
I'm estimating the False Discovery Rate (FDR) for each element of p representing how many p-values from random permutations will pass if the current p-value (from p) will be the threshold.
I wrote the function with ARRAYFUN, but it takes lot of time for large N (2 min for N=20K), comparable to for-loop.
function pfdr = fdr_from_random_permutations(p, pr)
%# ... skipping arguments checks
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
Any ideas how to make it faster?
Comments about statistical issues here are also welcome.
The test data can be generated as p = rand(N,1); pr = rand(N,M);.
Well, the trick was indeed sorting the vectors. I give credit to #EgonGeerardyn for that. Also, there is no need to use mean. You can just divide everything afterwards by M. When p is sorted, finding the amount of values that are less than current x, is just a running index. pr is a more interesting case - I used a running index called place to discover how many elements are less than x.
Edit(2): Here is the fastest version I come up with:
function Speedup2()
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1); pr = rand(N,M);
tic
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
tic
out = zeros(numel(p),1);
[p,sortIndex] = sort(p);
pr = sort(pr(:));
pr(end+1) = Inf;
place = 1;
N = numel(pr);
for i=1:numel(p)
x = p(i);
while pr(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
disp(max(abs(pfdr-out)));
end
And the benchmark results for N = 10000/4 ; M = 100/4 :
Elapsed time is 0.898689 seconds.
Elapsed time is 0.007697 seconds.
2.220446049250313e-016
and for N = 10000 ; M = 100 ;
Elapsed time is 39.730695 seconds.
Elapsed time is 0.088870 seconds.
2.220446049250313e-016
First of all, tr to analyze this using the profiler. Profiling should ALWAYS be the first step when trying to improve performance. We can all guess at what is causing your performance drop, but the only way to be sure and focus on the right part is to inspect the profiler report.
I didn't run the profiler on your code, as I don't want to generate test data to do so; but I have some ideas about what work is being carried out in vain. In your function mean(sum(pr<=x))./sum(p<=x), you are repeatedly summing over p<=x. All in all, one call includes N comparisons and N-1 summations. So for both, you have behavior that is quadratic in N when all N values of p are calculated.
If you step through a sorted version of p, you need less calculations and comparisons, as you can keep track of a running sum (i.e. behavior that is linear in N). I guess a similar method could be applied to the other part of the calculation.
edit:
The implementation of my idea as expressed above:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
[pr] = sort(pr(:));
pfdr = NaN(N,1);
parfor iP = 1:N
x = p(iP);
m = sum(pr<=x)/M;
pfdr(iP) = m/iP;
end
pfdr(idxP) = pfdr;
If you have access to the parallel computing toolbox, the parfor loop will allow you to gain some performance. I used two basic ideas: mean(sum(pr<=x)) is actually equal to sum(pr(:)<=x)/M. On the other hand, since p is sorted, this allows you to just take the index as the number of elements (in the assumption that every element is unique, otherwise you'll have to work with unique to do the full rigorous analysis).
As you should already know very well by running the profiler yourself, the line m = sum(pr<=x)/M; is the main resource hog. This can be tackled similarly to p by making use of the sorted nature of pr.
I tested my code (both for identical results and for time consumption) against yours. For N=20e3; M=100, I get about 63 seconds to run your code and 43 seconds to run mine on my main computer (MATLAB 2011a on 64 bit Arch Linux, 8 GiB RAM, Core i7 860). For smaller values of M the gain is larger. But this gain is in part due to parallelization.
edit2: Apparently, I came to very similar results as Andrey, my result would have been very similar had I pursued the same approach.
However, I realised that there are some built-in functions that do more or less what you need, i.e. quite similar to determining the empirical cumulative density function. And this can be done by constructing the histogram:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
count = histc(pr(:), [0; p]);
count = cumsum(count(1:N));
pfdr = count./(1:N).';
pfdr(idxP) = pfdr/M;
For the same M and N as above, this code takes 228 milliseconds on my computer. It takes 104 milliseconds for Andrey's parameters, so on my computer it turns out a bit slower, but I think this code is far more readable than intricate for loops (as was the case in both our examples).
Following the discussion between me and Andrey in this question, this very late answer is just to prove to Andrey that vectorized solutions are still faster than JIT'ed loops, they sometimes just aren't as easy to find.
I am more than willing to remove this answer if it is deemed inappropriate by the OP.
Now, on to business, here's the original arrayfun, looped version by Andrey, and vectorized version by Egon:
function test
clc
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1);
pr = rand(N,M);
%% first option
tic
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
%% second option
tic
out = zeros(numel(p),1);
[p2,sortIndex] = sort(p);
pr2 = sort(pr(:));
pr2(end+1) = Inf;
place = 1;
for i=1:numel(p2)
x = p2(i);
while pr2(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
%% third option
tic
[p2,sortIndex] = sort(p);
count = histc(pr2(:), [0; p2]);
count = cumsum(count(1:N));
out = count./(1:N).';
out(sortIndex) = out/M;
toc
end
Results on my laptop:
Elapsed time is 0.916196 seconds.
Elapsed time is 0.011429 seconds.
Elapsed time is 0.007328 seconds.
and for N=1000; M = 100; :
Elapsed time is 38.082718 seconds.
Elapsed time is 0.127052 seconds.
Elapsed time is 0.042686 seconds.
So: vectorized is 2-3 times faster.