I have a big matrix A which is 1GB of double values, when I reshape it to different dimensions, it's incredible fast.
A=rand(128,1024,1024);
tic;B=reshape(A,1024,128,1024);toc
Elapsed time is 0.000011 seconds.
How can it be that fast? Another observation, MATLAB uses less memory than it should after running that code and storing two matrices of 1GB each: Memory used by MATLAB: 1878 MB (1.969e+09 bytes)
Explanation of the good performance
Matlab uses copy-on-write whenever possible. If you write expressions like B=A, MATLAB does not copy A, instead both variables A and B are references to the same data structure. Only if one of the two variables will be modified, MATLAB will create a copy.
Now to the special case of reshape. Here it looks like A and B are not the same, but in memory they are. The underlying array which holds the data is unaffected by the reshape operation, nothing has to be moved: all(A(:)==B(:)). Everything MATLAB has to do when calling reshape is to create a new reference and annotate it with the new dimensions of the matrix. Reshaping a matrix is nothing more than creating a new reference to the input data, which is annotated with the new dimensions. The runtime of reshape is less than 1µs or roughly the time two simple assignments like B=A require. For all practical applications a zero time operation.
>> tic;for i=1:1000;B=reshape(A,1024,128,1024);end;toc
Elapsed time is 0.000724 seconds.
>> tic;for i=1:1000;B=A;end;toc
Elapsed time is 0.000307 seconds.
It is unknown how large such a reference really is, but we can assume it to be within a few bytes.
Other zero cost operations
Functions known to have practically zero cost (both runtime and memory):
B=reshape(A,sz)
B=A(:)
B=A.' - only for Vectors
B=A' - only for Vectors of real numbers, without the attribute complex. Use .' instead.
B=permute(A,p) - only for the cases where all(A(:)==B(:)).1
B=ipermute(A,p) - only for the cases where all(A(:)==B(:)).1
B=squeeze(A) 1
shiftdim - only for the cases where all(A(:)==B(:)), which are:1
used to remove leading singleton dimensions.
used with negative second input
used without second input argument.
Functions which are "expensive", regardless of the fact that they don't touch the representation in memory (all(A(:)==B(:)) is true)
Left sided indexing: B(1:numel(A))=A; 2
Right sided indexing other than (:), including B=A(1:end); and B=A(:,:,:); 2
1 Significantly slower runtime than reshape between 1µs and 1ms. Probably because of some constant computation overhead. Memory consumption is practically zero and the runtime is independent from the input size. Operations without this annotation have a runtime below 1µs and roughly equivalent to reshape.
2 Zero cost in OCTAVE
Originally used MATLAB 2013b when writing this post. Confirmed the numbers with MATLAB 2019b.
Related
Question from Julia Discourse
I’m using Julia 1.2. This is my test:
a = rand(1000, 1000)
b = adjoint(a)
c = copy(b)
#btime a * x setup=(x=rand(1000)) # 114.757 μs
#btime b * x setup=(x=rand(1000)) # 94.179 μs
#btime c * x setup=(x=rand(1000)) # 110.325 μs
I was expecting a and c to be at very least not slower.
After inspecting stdlib/LinearAlgebra/src/matmul.jl , it turns out that Julia passes b.parent (i.e. a ) to BLAS.gemv , not b , and instead switches LAPACK’s dgemv_ into a different and apparently faster mode.
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode? If so, then I’m guessing this isn’t actionable, besides possibly mentioning the gotcha in the docs somehow. If my assumption is wrong though, is there something to be done about this?
Answer from #stevengj in the same Discourse thread:
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode?
Close. It does have to do with memory, but it’s about locality, not alignment. The basic thing to understand is that it is more efficient to access consecutive (or at least nearby) data from memory than data that is separated, due to the existence of cache lines. (Consecutive access also has some advantages in utilizing SIMD instructions.)
Julia stores matrices in column-major order, so that the columns are contiguous in memory. When you multiply a transposed matrix (that has not been copied) by a vector, therefore, it can compute it as the dot product of the contiguous column (= transposed row) with the contiguous vector, which has good spatial locality and therefore utilizes cache lines efficiently.
For multiplying a non -transposed matrix by a vector, in contrast, you are taking the dot products of non-contiguous rows of the matrix with the vector, and it is harder to efficiently utilize cache lines. To improve spatial locality in this case, an optimized BLAS like OpenBLAS actually computes the dot products of several rows at a time (a “block”) with the vector, I believe — that’s why it’s only 10% slower and not much worse. (In fact, even the transposed case may do some blocking to keep the vector in cache.)
I have to multiply two very large (~ 2000 X 2000) dense matrices whose entries are floats with arbitrary precision (I am using GMP and the precision is currently set to 600). I was wondering if there is any CUDA library that supports arbitrary precision arithmetics? The only library that I have found is called CAMPARY however it seems to be missing some references to some of the used functions.
The other solution that I was thinking about was implementing a version of the Karatsuba algorithm for multiplying matrices with arbitrary precision entries. The end step of the algorithm would just be multiplying matrices of doubles, which could be done very efficiently using cuBLAS. Is there any similar implementation already out there?
Since nobody has suggested such a library so far, let's assume that one doesn't exist.
You could always implement the naive implementation:
One grid thread for each pair of coordinates in the output matrix.
Each thread performs an inner product of a row and a column in the input matrices.
Individual element operations will use the code taken from the GMP (hopefully not much more than copy-and-paste).
But you can also do better than this - just like you can do better for regular-float matrix multiplication. Here's my idea (likely not the best of course):
Consider the worked example of matrix multiplication using shared memory in the CUDA C Programming Guide. It suggests putting small submatrices in shared memory. You can still do this - but you need to be careful with shared memory sizes (they're small...):
A typical GPU today has 64 KB shared memory usable per grid block (or more)
They take 16 x 16 submatrix.
Times 2 (for the two multiplicands)
Times ceil(801/8) (assuming the GMP representation uses 600 bits from the mantissa, one bit for the sign and 200 bits from the exponent)
So 512 * 101 < 64 KB !
That means you can probably just use the code in their worked example as-is, again replacing the float multiplication and addition with code from GMP.
You may then want to consider something like parallelizing the GMP code itself, i.e. using multiple threads to work together on single pairs of 600-bit-precision numbers. That would likely help your shared memory reading pattern. Alternatively, you could interleave the placement of 4-byte sequences from the representation of your elements, in shared memory, for the same effect.
I realize this is a bit hand-wavy, but I'm pretty certain I've waved my hands correctly and it would be a "simple matter of coding".
I'm a J newbie, and am trying to import one of my large datasets for further experimentation. It is a 2D matrix of doubles, approximately 80000x50000. So far, I have found two different methods to load data into J.
The first is to convert the data into J format (replacing negatives with underscores, putting exponential notation numbers into J format, etc) and then load with (adapted from J: Handy method to enter a matrix?):
(".;._2) fread 'path/to/file'
The second method is to use tables/dsv.
I am experiencing the same problem with both methods: namely, that these methods work with small matrices, but fail at approximately 10M values. It seems the input just gets truncated to some arbitrary limit. How can I load matrices of arbitrary size? If I have to convert to some binary format, that's OK, as long as there is a description of the format somewhere.
I should add that this is a 64-bit system and build of J, and I can successfully create a matrix of random numbers of the appropriate size, so it doesn't seem to be a limitation on matrix size per se but only during I/O.
Thanks!
EDIT: I did not find what exactly was causing this, but thanks to Dane, I did find a workaround by using JMF ( 'data/jmf' package). It turns out that JMF is just straight binary data with no header and native (?) or little-endian data can be mapped directly with JFL map_jmf_ 'x';'whatever.bin'
You're running out of memory. A quick test to see how much space integers take up yields the following:
7!:2 'i. 80000 5000'
8589936256
That is, an 80,000 by 5,000 matrix of integers requires 8 GB of memory. Your 80,000 by 50,000 matrix, if it were of integers, would require approximately 80 GB of memory.
Your next question should be about performing array or matrix operations on a matrix too big to load into memory.
I would like to introduce an interesting MATLAB programming problem I’ve encountered in my research. The solution may be of use to people doing computations on very large data sets. It involves striking a balance between RAM and CPU usage using parfor. Because my data is so large, files must be read in over and over again to be processed. The other issue it introduces is finding an optimal algorithm for multiplication, summation and averaging of very large vectors and matrices.
I have found a solution, but it’s time intensive and I would like too see if the community sees any room for improvement. Here’s the general form of the problem.
Suppose we have about 30,000 functions that we’ve taken the Fourier transforms of. Each transform has the form e(k)=a(k)+b(k)*i where k is the magnitude of a wavevector, a is the real component and b is the imaginary component. Each of these transforms is saved to file as a 2-column table with the structure below. The number of elements in each vector is about N=1e6. This means that each of these files is 1/64 GB in size. Note that the values of k_i are not necessarily in order.
k | Re(e) Im(e)
k_1 | a(1) b(1)
k_2 | a(2) b(2)
... |
k_N | a(N) b(N)
The goal is to cross-multiply each pair of modes and average the results over a set of about 50 fixed k-bands. So for example, if we let the elements of vectors 5 and 7 be represented respectively as e5=a5+b5*i and e7=a7+b7*i we need
a5(1)*a7(1) + b5(1)*b7(1)
a5(2)*a7(2) + b5(2)*b7(2)
...
a5(N)*a7(N) + b5(N)*b7(N)
Each element of the above N-dimensional vector belongs within a single k-bin. All the elements in each bin must be averaged and returned. So at the end of one mode-mode comparison we end up with just 50 numbers.
I have at my disposal a computer with 512GB of RAM and 48 processors. My version of MATLAB 2013a limits me to only opening 12 processors with parfor simultaneously on each instance of MATLAB that I run. So what I’ve been doing is opening 4 versions of MATLAB, allocating 10 processors each, and sending the maximum amount of data to each processor without spilling over my self-imposed limit of 450 GB.
Naturally this involves my breaking the problem into blocks. If I have 30000 vectors e, then I will have 30000^2 sets of these 50 cross-coefficients. (It’s actually about half this since the problem is symmetric). I decide to break my problem into blocks of size m=300. This means I’d have 100 rows and columns of these blocks. The code I’m running looks like this (I’m simplifying a bit to just include the relevant bits):
for ii=1:100 % for each of the block-rows
[a,b] = f_ReadModes(ii); % this function reads modes 1 through 300 if ii=1,
% modes 301 through 600 if ii=2 and so on
% “a” and “b” are matrices of size Nx300
% “a” contains the real vectors stored in columns,
% “b” contains the imaginary vectors stored in columns
for jj=ii:100 % for each of the block-columns in the upper triangle
[c,d] = f_ReadModes(jj); % same as above except this reads in a different
% set of 300 modes
block = zeros(300,300,50); % allocates space for the results. The first
% dimension corresponds to the "ii modes".
% The 2nd dimension corresponds to the “jj”
% modes. The 3rd dimension is for each k-bin.
parfor rr=1:300
A = zeros(1,300,50); % temporary storage to keep parfor happy
ModeMult = bsxfun(#times,a(:,rr),c)+bsxfun(#times,b(:,rr),d);
% My strategy is to cross multiply one mode by all the others before
% moving on. So when rr=6, I’m multiplying the 6th mode in a (and b)
% by all the modes in c and d. This would help fill in the 6th row
% of block, i.e. block(6,:,:).
for kk=1:50 % Now I average the results in each of the k-bins
ind_dk = f_kbins(kk); % this returns the rows of a,b,c,d and
% ModeMult that lie within the the kk^th bin
A(1,:,kk) = mean(ModeMult(ind_dk,:)); % average results in each bin
end
block(rr,:,:) = A; % place the results in more permanent storage.
end
f_WriteBlock(block); % writes the resulting cross-coefficient block to disk
end
end
There are three bottlenecks in this problem:
1) Read-in time
2) Computing the products ac and bd then summing them (this is ModeMult)
3) Averaging the results of step 2 in each k-bin
Bigger blocks are preferable since they necessitate fewer read-in’s. However, the computations in steps 2 and 3 don’t automatically parallelize, so they have to be sent to individual processors using parfor. Because the computational costs are high, utitlizing all the processors seems necessary.
The way my code is written, each processor needs enough memory to hold 4*N*m elements. When m=300, this means each processor is using about 10 GB of RAM. It would be nice if the memory requirement for each processor could be lowered somehow. It would also be great if the computations in steps 2 and 3 could be rewritten to run more efficiently. Any thoughts?
I have many large 1GB+ matrices of doubles (floats), many of them 0.0, that need to be stored efficiently. I indend on keeping the double type since some of the elements do require to be a double (but I can consider changing this if it could lead to a significant space saving). A string header is optional. The matrices have no missing elements, NaNs, NAs, nulls, etc: they are all doubles.
Some columns will be sparse, others will not be. The proportion of columns that are sparse will vary from file to file.
What is a space efficient alternative to CSV? For my use, I need to parse this matrix quickly into R, python and Java, so a file format specific to a single language is not appropriate. Access may need to be by row or column.
I am also not looking for a commercial solution.
My main objective is to save HDD space without blowing out io times. RAM usage once imported is not the primary consideration.
The most important question is if you always expand the whole matrix into memory or if you need a random access to the compacted form (and how). Expanding is way simpler, so I'm concentrating on this.
You could use a bitmap stating if a number is present or zero. This costs 1 bit per entry and thus can increase the file size by 1/64 in case of no zeros or shrink it to 1/64 in case of all zeros. If there are runs of zeros, you may store the number of following zeros and the number non-zeros instead, e.g., by packing two 4-bit numbers into one byte.
As the double representation is standard, you can use binary representation in both languages. If many of your numbers are actually ints, you may consider something like I did.
If consecutive numbers are related, you could consider storing their differences.
I indend on keeping the double type since some of the elements do require to be a double (but I can consider changing this if it could lead to a significant space saving).
Obviously, switching to float would trade half precision for haltf the memory. This is probably too imprecise, so you could instead omit a few bits from the mantisa and get e.g. 6 bytes per entry. Alternatively, you could reduce the exponent to a single byte instead as the range 1e-38 to 3e38 should suffice.