How to make matrix calculations as fast as possible

How to make matrix calculations as fast as possible - algorithm

Purely for my own knowledge and understanding of code and computers, I am trying to create an array/matrix class with multiple matrix functions, which I will then use in any projects I need a matrix or array class for. Most significantly, I would like to make a neural network library using this matrix/array class, and therefore require it to be as fast as possible.
The function I require to be fastest is the matrix product calculation of two matrices, however, I have had little luck trying to make this calculation fast with larger matrices.
My current method for calculating the dot product is:
Note, this code is in python, however, if python is not the optimal language, I can use any other
a = [[1, 2, 3], [4, 5, 6]]
b = [[1], [2], [3]]
def dot(a, b):
c = [[0 for j in range(len(b[i]))] for i in range(len(a))]
for i in range(len(c)):
for j in range(len(c[i])):
t = 0
for k in range(len(b)):
t += a[i][k] * b[k][j]
c[i][j] = t
return c
print(dot(a, b))
# [[14], [32]]
I have looked into the Intel MKL (I have an intel core i7) and other BLAS implementations like OpenBLAS, however I have not been able to get any results that worked, and no amount of googling can make them work, so my question is, what is the fastest way to calculate the dot product of two matrices? (CPU and memory usage do not matter much to me currently, however, being more efficient would be nice)
PS:
I am trying to do all of this using no external libraries (numpy, for example, in python)
***** UPDATE *****
I am using a mac
***** UPDATE 2 *****
Thank you everyone for all of your help, however, I am unsure how to implement these methods of calculating the dot product as my math skills are not yet advanced enough to understand what any of it means (I am yet to start my GCSEs), though I will keep these ideas in mind and will experiment with these ideas further.
Thank you again for everyone's help.

If it possible, you can use CUDA to utilize GPU for very fast calculations.

You can use GPU
as AbdelAziz AbdelLatef suggested in his answer. However this limits the usage of your lib to computers with GPU.
Parallelize the dot products for big matrices
use SIMD instructions
use state of the art algorithms
some operations on big data sets can be done much faster using more advanced techniques which are too slow for small matrices ... usually involving FFT or NTT ... Matrix multiplication is set of dot products and dot product is form of convolution so FFT approach should be applicable but had never done that for matrices/vectors ...
there are also special algorithms solely for matrices like Strassen algorithm
for powers you can use power by squaring, for sqr I think you can simplify even more some multiplications would be the same ...
Python is far from optimal as its slow I would do such thing in C++ or even combine with asm if there is the need for extreme speed (like the SIMD instructions use). IIRC you still can use C++ created libs in python (link as DLL,obj,...)
However if you need fast neural network then use dedicated HW. There are neural network processing ICs out there too.

Related

Proper way to implement biases in Neural Networks

I can make a neural network, I just need a clarification on bias implementation. Which way is better: Implement the Bias matrices B1, B2, .. Bn for each layer in their own, seperate matrix from the weight matrix, or, include the biases in the weight matrix by adding a 1 to the previous layer output (input for this layer). In images, I am asking whether this implementation:
Or this implementation:
Is the best. Thank you

I think the best way is to have two separate matrices, one for the weitghts and one for the bias. Why? :
I don't believe there is an increase on the computational load since W*x and W*x + b should be equivalent running on GPU. Mathematically and computationally they are equivalent.
Greater modularity. Let's say you want to initialize the weights and the bias using different initializers (ones, zeros, glorot...). By having two separate matrices this is straightforward.
Easier to read and maintain.

include the biases in the weight matrix by adding a 1 to the previous layer output (input for this layer)
This seems to be what is implemented here: Machine Learning with Python: Training and Testing the Neural Network with MNIST data set in the paragraph "Networks with multiple hidden layers".
I don't know if it's the best way to do it though. (Maybe not related but still: in the mentioned example code, it worked with sigmoid, but failed when I replaced it with ReLU).

In my opinion I think implementing the bias matrices separately for each layer is the way to go. This will create a lot of hyper-parameters that your model will have to learn but it will give your model more freedom to converge.
For more information read this.

Sparse matrix design in modern C++

I want to implement a sparse matrix class using modern c++ ie 14 or 17. I know there must be some trade offs between storage and runtime efficiency. Right now I'd prefer to optimize more in terms of storage efficiency. If possible, I'd prefer more work to be done at compile time rather than runtime. For example, vector does have a lot of runtime checks so it may not be optimal. Can someone suggest a container for this? I plan to support the following operations:
Matrix Multiplication, Addition, Subtraction, Inversion and Transpose
Matrix iterators ie column row
Efficient constructors
etc
Thanks!

small Matrix Inversion on CUDA

I need a bit of advice from you, and I hope it won't take a lot of your time.
So here is my question:
I have a small square dense matrix, with possible sizes 4x4, 8x8, 16x16,
and I want to inverse it using CUDA.
The special part of the question is that I have 1024 idle cuda threads to perform this task.
So I have a suspicion that the most widespread inverse methods like Gauss Jordan won't properly work here, because they are slightly parallel and will use only about 4-16 threads from huge amount of 1024.
But how else can I inverse this matrices using all available threads?
Thank you for your attention!

There are at least two possible ready made options for this sort of problem:
Use the batched solvers shipping in recent versions of the CUBLAS library
Use the BSD licensed Gauss-Jordan elimination device code functions which NVIDIA distribute to registered developers. These were intended to invert small matrices using one thread per matrix
[This answer was assembled from comments and added as a community wiki entry to get the question off the unanswered queue]

Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA

I have a question on the eigen-decomposition of hundreds of small matrices using CUDA.
I need to calculate the eigenvalues and eigenvectors of hundreds (e.g. 500) of small (64-by-64) real symmetric matrices concurrently. I tried to implement it by the Jacobi method using chess tournament ordering (see this paper (PDF) for more information).
In this algorithm, 32 threads are defined in each block, while each block handles one small matrix, and the 32 threads work together to inflate 32 off-diagonal elements until convergence. However, I am not very satisfied with its performance.
I am wondering where there is any better algorithm for my question, i.e. the eigen-decomposition of many 64-by-64 real symmetric matrices. I guess the householder's method may be a better choice but not sure whether it can be efficiently implemented in CUDA. There are not a lot of useful information online, since most of other programmers are more interested in using CUDA/OpenCL to decompose one large matrix instead of a lot of small matrices.

At least for the Eigenvalues, a sample can be found in the Cuda SDK
http://www.nvidia.de/content/cudazone/cuda_sdk/Linear_Algebra.html
Images seem broken, but download of samples still works. I would suggest downloading the full SDK and having a look at that exsample. Also, this Paper could be helpfull:
http://docs.nvidia.com/cuda/samples/6_Advanced/eigenvalues/doc/eigenvalues.pdf

Performance Testing for Calculation-Heavy Programs

What are some good tips and/or techniques for optimizing and improving the performance of calculation heavy programs. I'm talking about things like complication graphics calculations or mathematical and simulation types of programming where every second saved is useful, as opposed to IO heavy programs where only a certain amount of speedup is helpful.
While changing the algorithm is frequently mentioned as the most effective method here,I'm trying to find out how effective different algorithms are in the first place, so I want to create as much efficiency with each algorithm as is possible. The "problem" I'm solving isn't something thats well known, so there are few if any algorithms on the web, but I'm looking for any good advice on how to proceed and what to look for.
I am exploring the differences in effectiveness between evolutionary algorithms and more straightforward approaches for a particular group of related problems. I have written three evolutionary algorithms for the problem already and now I have written an brute force technique that I am trying to make as fast as possible.
Edit: To specify a bit more. I am using C# and my algorithms all revolve around calculating and solving constraint type problems for expressions (using expression trees). By expressions I mean things like x^2 + 4 or anything else like that which would be parsed into an expression tree. My algorithms all create and manipulate these trees to try to find better approximations. But I wanted to put the question out there in a general way in case it would help anyone else.
I am trying to find out if it is possible to write a useful evolutionary algorithm for finding expressions that are a good approximation for various properties. Both because I want to know what a good approximation would be and to see how the evolutionary stuff compares to traditional methods.

It's pretty much the same process as any other optimization: profile, experiment, benchmark, repeat.
First you have to figure out what sections of your code are taking up the time. Then try different methods to speed them up (trying methods based on merit would be a better idea than trying things at random). Benchmark to find out if you actually did speed them up. If you did, replace the old method with the new one. Profile again.

I would recommend against a brute force approach if it's at all possible to do it some other way. But, here are some guidelines that should help you speed your code up either way.
There are many, many different optimizations you could apply to your code, but before you do anything, you should profile to figure out where the bottleneck is. Here are some profilers that should give you a good idea about where the hot spots are in your code:
GProf
PerfMon2
OProfile
HPCToolkit
These all use sampling to get their data, so the overhead of running them with your code should be minimal. Only GProf requires that you recompile your code. Also, the last three let you do both time and hardware performance counter profiles, so once you do a time (or CPU cycle) profile, you can zoom in on the hotter regions and find out why they might be running slow (cache misses, FP instruction counts, etc.).
Beyond that, it's a matter of thinking about how best to restructure your code, and this depends on what the problem is. It may be that you've just got a loop that the compiler doesn't optimize well, and you can inline or move things in/out of the loop to help the compiler out. Or, if you're running as fast as you can with basic arithmetic ops, you may want to try to exploit vector instructions (SSE, etc.) If your code is parallel, you might have load balance problems, and you may need to restructure your code so that data is better distributed across cores.
These are just a few examples. Performance optimization is complex, and it might not help you nearly enough if you're doing a brute force approach to begin with.
For more information on ways people have optimized things, there were some pretty good examples in the recent Why do you program in assembly? question.

If your optimization problem is (quasi-)convex or can be transformed into such a form, there are far more efficient algorithms than evolutionary search.
If you have large matrices, pay attention to your linear algebra routines. The right algorithm can make shave an order of magnitude off the computation time, especially if your matrices are sparse.
Think about how data is loaded into memory. Even when you think you're spending most of your time on pure arithmetic, you're actually spending a lot of time moving things between levels of cache etc. Do as much as you can with the data while it's in the fastest memory.
Try to avoid unnecessary memory allocation and de-allocation. Here's where it can make sense to back away from a purely OO approach.

This is more of a tip to find holes in the algorithm itself...
To realize maximum performance, simplify everything inside the most inner loop at the expense of everything else.
One example of keeping things simple is the classic bouncing ball animation. You can implement gravity by looking up the definition in your physics book and plugging in the numbers, or you can do it like this and save precious clock cycles:
initialize:
float y = 0; // y coordinate
float yi = 0; // incremental variable
loop:
y += yi;
yi += 0.001;
if (y > 10)
yi = -yi;
But now let's say you're having to do this with nested loops in an N-body simulation where every particle is attracted to every other particle. This can be an enormously processor intensive task when you're dealing with thousands of particles.
You should of course take the same approach as to simplify everything inside the most inner loop. But more than that, at the very simplest level you should also use data types wisely. For example, math operations are faster when working with integers than floating point variables. Also, addition is faster than multiplication, and multiplication is faster than division.
So with all of that in mind, you should be able to simplify the most inner loop using primarily addition and multiplication of integers. And then any scaling down you might need to do can be done afterwards. To take the y and yi example, if yi is an integer that you modify inside the inner loop then you could scale it down after the loop like this:
y += yi * 0.01;
These are very basic low-level performance tips, but they're all things I try to keep in mind whenever I'm working with processor intensive algorithms. Of course, if you then take these ideas and apply them to parallel processing on a GPU then you can take your algorithm to a whole new level. =)

Well how you do this depends the most on which language
you are using. Still, the key in any language
in the profiler. Profile your code. See which
functions/operations are taking the most time and then determine
if you can make these costly operations more efficient.
Standard bottlenecks in numerical algorithms are memory
usage (do you access matrices in the order which the elements
are stored in memory); communication overhead, etc. They
can be little different than other non-numerical programs.
Moreover, many other factors such as preconditioning, etc.
can lead to drastically difference performance behavior
of the SAME algorithm on the same problem. Make sure
you determine optimal parameters for your implementations.
As for comparing different algorithms, I recommend
reading the paper
"Benchmarking optimization software with performance profiles,"
Jorge Moré and Elizabeth D. Dolan, Mathematical Programming 91 (2002), 201-213.
It provides a nice, uniform way to compare different algorithms being
applied to the same problem set. It really should be better known
outside of the optimization community (in my not so humble opinion
at least).
Good luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio