Performance of sum with mask vs. do-loop - performance

Is the intrinsic function sum with masks slower than an old-school do-loop?
I am writing a program for a financial application in Fortran 90.
There is a user-defined-type, which lumps together some 50 properties of a
single bond, like book-value, time-value, interest-rates, which are of atomic type integer, real and character.
type(portfolioElementType)
real(8) :: timeValue
real(8) :: bookValue
real(8) :: maturityInYears
+ many, many more
end type
The whole portfolio consists of a 2-dimensional matrix with 3000 times 10 elements of type portfolioElementType
type(portfolioElementType), dimension(:,:), pointer :: portfolio
allocate(portfolio(1:3000,1:10))
I want to aggregate the portfolio by maturityInYears. In SQL you would call it a group-by-statement.
Currently it is done by
do bond = 1 to numberOfBonds
do part = 1 to numberOfParts
if (portfolio(bond,part)%maturity .eq. 10) then
sum10Years = sum10Years + portfolio(bond,part)%bookValue
end if
end do
end do
I would prefer for purely aesthetic reasons only to do it by:
sum10 = sum(portfolio%bookValue, mask = portfolio%maturity .eq. 10)
Booth options give the same result, but the sum-approach seems to be slower than the do-loop. This surprises me, because summing some 30 000 elements
should be almost effortless?
My hypothesis is, that the "large" user defined type is slowing things down.
I don't know how Fortran handles its memory structures, but I guess that the
"bookValues" are located "far" from each other in memory?
Do you think it advisable if I used a matrix of a "large" user-defined-datastructure compared to a user-defined-datastructure of "big" matrices in the sense:
portfolio(a,b)%bookValue versus
bigPortfolio%bookValue(a,b)
with
type(bigPortfolio)
real(8), dimension(1:3000,1:10), pointer :: bookValue
...many more matrices
end type

Related

How to output 2 or more arrays in a fortran's function?

I am writing a program which computes the LU decomposition of a matrix, with partial pivoting, and I would like the function to output several (2 or 3) matrices without running the program several times to output each one individually, which is a waste of time since it gets me everything I want in one run. Is there a way of doing this? For example, here is my function using Doolittle's algorithm, for square matrix which don't need pivoting. I want my output to be matrix l and u at once, but I know no means of doing that.
function lu_d(aa) result(l)
real, dimension (:,:) :: aa !input matrix
real, dimension (size(aa,1), size(aa,2)) :: a !keeping input variable intact
real, dimension (size(a,1), size(a,2)) :: l , u !lower and upper matrices
integer :: i,j,k !index
real :: s !auxiliar variable
a=aa
do j=1 , size(a,2)
u(1,j)=a(1,j)
end do
l(1,1)=1
do j=2, size(a,2)
l(1,j)=0
end do
do i=2, size(a,1)
l(i,1)=a(i,1)/u(1,1)
u(i,1)=0
do j=2, i-1
s=0
u(i,j)=0
do k=1, j-1
s=s+l(i,k)*u(k,j)
end do
l(i,j)=(a(i,j)-s)/u(j,j)
end do
l(i,i)=1
do j=i, size(a,2)
s=0
l(i,j)=0
do k=1, i-1
s=s+l(i,k)*u(k,j)
end do
u(i,j)=a(i,j)-s
end do
end do
end function
You could switch from using a function to using a subroutine. This way you can output values for multiple arrays in the arguments list. Additionally using the
INTENT definition when declaring variables in the subroutine, e.g.:
REAL,INTENT(IN)::a declares a and does not allow its values to be altered inside the subroutine/function
REAL,INTENT(OUT)::b declares b and disregards any values it has coming into the subroutine/function
REAL,INTENT(INOUT)::c this is the case by default, if you don't write anything.
I will assume you need the output to be l and u (rather than m), in which case the structure would look something like the one below. Note that l and m should either be declared in the main program and their size defined with respect to aa (as in the first case shown below) OR declared with an allocatable size in the main program, passed to the subroutine without being allocated and allocated within the subroutine (second example). The latter may require you to put the subroutine in a module so that the interfaces are handled properly.
First example:
SUBROUTINE lu_d(aa,l,m)
implicit none
real,intent(in):: a(:,:)
real,intent(out):: l(:,:), m(:,:)
integer:: i,j,k
real:: s
<operations>
RETURN
END SUBROUTINE lud_d
Second example:
SUBROUTINE lu_d(aa,l,m)
implicit none
real,intent(in):: a(:,:)
real,allocatable,intent(out):: l(:,:), m(:,:)
integer:: i,j,k,size_a1,size_a2
real:: s
size_a1=size(aa,1)
size_a2=size(aa,2)
allocate( l(size_a1,size_a2), m(size_a1,size_a2))
<operations>
RETURN
END SUBROUTINE lud_d

Generating different sequence of random numbers in Fortran 95 for every run [duplicate]

I am trying to compile the following simple code using Gfortran 4.7 from mac-ports (OS-X):
program main
implicit none
integer :: n = 1, clock, i
integer, dimension(1) :: iseed
! initialize the random number generator
call random_seed(size = n)
call system_clock(COUNT=clock)
iseed = clock + 37 * (/ (i - 1, i = 1, n) /)
! iseed = clock
! iseed = abs( mod((clock*181)*((1-83)*359), 104729) )
call random_seed(PUT = iseed)
end program main
and have this error:
gfortran-mp-4.7 tmp.f90
tmp.f90:17.23:
call random_seed(PUT = iseed)
1
Error: Size of 'put' argument of 'random_seed' intrinsic at (1) too small (1/12)
I don't use Fortran at all (I am a C++ guy), so would really appreciate if someone could help and make it working.
p.s. On a similar issue i found couple of forum posts, the current uncomment solution is similar to the one mentioned in this GCC bug report.
The one with abs is mentioned in this stack overflow post (added it without PID since i don't run in parallel anyway.
UPDATE:
the following works:
program main
implicit none
integer :: n = 12, clock, i
integer, dimension(:), allocatable :: iseed
! initialize the random number generator
allocate(iseed(n))
call random_seed(size = n)
call system_clock(COUNT=clock)
iseed = clock + 37 * [(i, i = 0,n-1)]
call random_seed(PUT = iseed)
end program main
To amplify somewhat on #Yossarian's comment, this
call random_seed(size = n)
returns, in n, the size of the rank 1 integer array that you have to use if you want to initialise the RNG. I'd suggest making iseed allocatable by changing its declaration to:
integer, dimension(:), allocatable :: iseed
then, after getting a value for n, allocate it:
allocate(iseed(n))
populate it with your favourite values, then put it.
You might be able to allocate and populate it in one statement like this:
allocate(iseed(n), source = clock + 37 * [(i, i = 0,n-1)])
I write might because this depends on how up to date your compiler is.
EDIT, after OP comment
No, you have not quite understood what I suggested.
Get a value for n by executing
call random_seed(size = n)
don't initialise n to 12.
Then allocate the array and populate it, either in one statement (using sourced allocation) or an allocate statement followed by an assignment.
In
allocate(iseed(n))
call random_seed(size = n)
the sequence of operations is incorrect. This sets iseed to have 12 elements (which is the value of n when the first statement is executed), and then sets n to the size of the array required by the RNG. So long as that is 12 you won't see any problems, but as soon as you port your code to another compiler, possibly even another version of the same compiler, you risk running into an RNG which requires an integer array of a different size. There is no need to hardwire a value into your code, so don't.

calculating without using BigInt

is there any way to print the value of n^1000 without using BigInt? have been thinking on the lines of using some sort of shift logic but haven't been able to come up with something good yet.
You can certainly do this, and I recommend it as an exercise. Beyond that there's little reason to implement this in a language with an existing BigInteger implementation.
In case you're looking for an exercise, it's really helpful to do it in a language that supports BigIntegers out of the box. That way you can gradually replace BigInteger operations with your own until there's nothing left to replace.
BigInteger libraries typically represent values larger than the largest primitive by using an array of the same primitive type, such as byte or int. Here's some Python I wrote that models unsigned bytes (UByte) and lists of unsigned bytes (BigUInt). Any BigUInt with multiple UBytes treats index 0 as the most-significant byte, making it a big-endian representation. Doing the opposite is fine too.
class UByte:
def __init__(self, n=0):
n = int(n)
if (n < 0) or (n >= 255):
raise ValueError("Expecting integer in range [0,255).")
self.__n = n
def value(self):
return self.__n
class BigUInt:
def __init__(self, b=[]):
self.__b = b
def value(self):
# treating index 0 as most-significant byte (big endian)
byte_count = len(self.__b)
if byte_count == 0:
return 0
result = 0
for i in range(byte_count):
place_value = 8 * (byte_count - i - 1)
byte_value = self.__b[i].value() << place_value
result += byte_value
return result
def __str__(self):
# base 10 representation
return "%s" % self.value()
The code above doesn't quite do what you want. Several parts of BigUInt#value depend on Python's built-in BigIntegers, for instance the left-shifting to compute byte_value doesn't overflow, even when place_value is really large. In lower-level machine code, each value has a fixed number of bits and left shifting without care can result in lost information. Similarly, the += operation to update the result would eventually overflow for the same reason in lower-level code, but Python handles that for you.
Notice that __str__ is implemented by calling value(). One way to bypass Python's magic is by reimplementing __str__ so it doesn't call value(). Figure out how to translate a binary number into a string of base-10 digits. Once that's done, you can implement value() in terms of __str__ simply by calling return int(self.__str__())
Here are some sample tests for the code above. They may help as a sanity check while you rework the code.
ten_as_byte = UByte(10)
ten_as_big_uint = BigUInt([UByte(10)])
print "ten as byte ?= ten as ubyte: %s" % (ten_as_byte.value() == ten_as_big_uint.value())
three_hundred = 300
three_hundred_as_big_uint = BigUInt([UByte(0x01), UByte(0x2c)])
print "three hundred ?= three hundred as big uint: %s" % (three_hundred == three_hundred_as_big_uint.value())
two_to_1000th_power = 2**1000
two_to_1000th_power_as_big_uint = BigUInt([UByte(0x01)] + [UByte() for x in range(125)])
print "2^1000 ?= 2^1000 as big uint: %s" % (two_to_1000th_power == two_to_1000th_power_as_big_uint.value())
EDIT: For a better low-level description of what's required, refer to chapter 2 of the From NAND to Tetris curriculum. The project in that chapter is to implement a 16-bit ALU (Arithmetic Logic Unit). If you then extend the ALU to output an overflow bit, an arbitrary number of these ALUs can be chained together to handle fundamental computations over arbitrarily large input numbers.
Raising a small number - one that fits into a normal integer variable - to a high power is one of the easiest big integer operations to implement, and it is often used as a task to let people discover some big integer math implementation principles.
A good discussion - including lots of different examples in C - is in the topic Sum of digits in a^b over on Code Review. My contribution there shows how to do fast exponentiation via repeated squaring, using a std::vector<uint32_t> as a sort of 'fake' big integer. But there are even simpler solutions in that topic, just take your pick.
An easy way of testing C/C++ big integer code without having to go hunt for a big integer library is to compile the code as managed C++ in Visual C++ (Express), which gives you access to the .NET BigInteger class:
#using "System.Numerics.dll"
using System::Numerics::BigInteger;
BigInteger n = BigInteger::Parse("123456789");
System::Console::WriteLine(n.Pow(1000));

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
tic;
B = squeeze(sum(bsxfun(#times,w,A),1));
toc;
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
enddo
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
Timing
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
!==================================================
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
enddo
end do
end do
end function

Fortran matrix multiplication performance in different optimization

I'm reading the book "Scientific Software Development with Fortran", and there is an exercise in it I think very interesting:
"Create a Fortran module called MatrixMultiplyModule. Add three subroutines to it called LoopMatrixMultiply, IntrinsicMatrixMultiply, and MixMatrixMultiply. Each routine should take two real matrices as argument, perform a matrix multiplication, and return the result via a third argument. LoopMatrixMultiply should be written entirely with do loops, and no array operations or intrinsic procedures; IntrinsicMatrixMultiply should be written utilizing the matmul intrinsic function; and MixMatrixMultiply should be written using some do loops and the intrinsic function dot_product. Write a small program to test the performance of these three different ways of performing the matrix multiplication for different sizes of matrices."
I did some test of multiply of two rank 2 matrix and here are the results, under different optimization flags:
compiler:ifort version 13.0.0 on Mac
Here is my question:
Why under -O0 they have about the same performance but matmul has huge performance boost when using -O3, while explicit loop and dot product has less performance boost? Also, why dot_product seems have the same performance compare to explicit do loops?
The code I use is the following:
module MatrixMultiplyModule
contains
subroutine LoopMatrixMultiply(mtx1,mtx2,mtx3)
real,intent(in) :: mtx1(:,:),mtx2(:,:)
real,intent(out),allocatable :: mtx3(:,:)
integer :: m,n
integer :: i,j
if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
m=size(mtx1,dim=1)
n=size(mtx2,dim=2)
allocate(mtx3(m,n))
mtx3=0.
do i=1,m
do j=1,n
do k=1,size(mtx1,dim=2)
mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
end do
end do
end do
end subroutine
subroutine IntrinsicMatrixMultiply(mtx1,mtx2,mtx3)
real,intent(in) :: mtx1(:,:),mtx2(:,:)
real,intent(out),allocatable :: mtx3(:,:)
integer :: m,n
integer :: i,j
if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
m=size(mtx1,dim=1)
n=size(mtx2,dim=2)
allocate(mtx3(m,n))
mtx3=matmul(mtx1,mtx2)
end subroutine
subroutine MixMatrixMultiply(mtx1,mtx2,mtx3)
real,intent(in) :: mtx1(:,:),mtx2(:,:)
real,intent(out),allocatable :: mtx3(:,:)
integer :: m,n
integer :: i,j
if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
m=size(mtx1,dim=1)
n=size(mtx2,dim=2)
allocate(mtx3(m,n))
do i=1,m
do j=1,n
mtx3(i,j)=dot_product(mtx1(i,:),mtx2(:,j))
end do
end do
end subroutine
end module
program main
use MatrixMultiplyModule
implicit none
real,allocatable :: a(:,:),b(:,:)
real,allocatable :: c1(:,:),c2(:,:),c3(:,:)
integer :: n
integer :: count, rate
real :: timeAtStart, timeAtEnd
real :: time(3,10)
do n=100,1000,100
allocate(a(n,n),b(n,n))
call random_number(a)
call random_number(b)
call system_clock(count = count, count_rate = rate)
timeAtStart = count / real(rate)
call LoopMatrixMultiply(a,b,c1)
call system_clock(count = count, count_rate = rate)
timeAtEnd = count / real(rate)
time(1,n/100)=timeAtEnd-timeAtStart
call system_clock(count = count, count_rate = rate)
timeAtStart = count / real(rate)
call IntrinsicMatrixMultiply(a,b,c2)
call system_clock(count = count, count_rate = rate)
timeAtEnd = count / real(rate)
time(2,n/100)=timeAtEnd-timeAtStart
call system_clock(count = count, count_rate = rate)
timeAtStart = count / real(rate)
call MixMatrixMultiply(a,b,c3)
call system_clock(count = count, count_rate = rate)
timeAtEnd = count / real(rate)
time(3,n/100)=timeAtEnd-timeAtStart
deallocate(a,b)
end do
open(1,file="time.txt")
do n=1,10
write(1,*) time(:,n)
end do
close(1)
deallocate(c1,c2,c3)
end program
There are several things one should be aware of when looping over array elements:
Make sure the inner loop is over consecutive elements in memory. In your current 'loop' algorithm, the inner loop is over index k. Since matrices are laid out in memory as columns (first index varying most rapidly when going through the memory), accessing a new value of k might need to load a new page into cache. In this case, you could optimize your algorithm by reordering the loops as:
do j=1,n
do k=1,size(mtx1,dim=2)
do i=1,m
mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
end do
end do
end do
now, the inner loop is over consecutive elements in memory (the mtx2(k,j) value will be probably be fetched only once before the inner loop by the compiler, if not you can store it in a temporary variable before the loop)
Make sure the entire loops can fit into the cache in order to avoid too much cache misses. This can be done by blocking the algorithm. In this case, a solution could be e.g.:
l=size(mtx1,dim=2)
ichunk=512 ! I have a 3MB cache size (real*4)
do jj=1,n,ichunk
do kk=1,l,ichunk
do j=jj,min(jj+ichunk-1,n)
do k=kk,min(kk+ichunk-1,l)
do i=1,m
mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
end do
end do
end do
end do
end do
in which case performance will depend in the size of ichunk, especially for large enough matrices (you could even block the inner loop, this is just an example).
Make sure the work needed to perform the loop is much smaller than the work inside the loop. This can be solved by 'loop unrolling', i.e. combining several statements in one iteration of the loop. Usually the compiler can do this by supplying the flag -funroll-loops.
If I use the above code and compile with the flags -O3 -funroll-loops, I get a slightly better performance than with matmul.
The important thing to remember of those three is the first point about loop ordering, since this is something that will affect performance in other use cases, and the compiler cannot usually fix that. The loop unrolling, you can leave to the compiler (but test it, as this does not always increase performance). As for the second point, since this is dependent on the hardware, you shouldn't (generally) try to implement a very efficient matrix multiplication yourself and instead consider using a library such as e.g. atlas, which can optimize for cache size, or a vendor library such as MKL or ACML.

Resources