Why is this code ten times slower on the GPU than CPU? - windows

I have a problem that boils down to performing some arithmetic on each element of a set of matrices. I thought this sounded like the kind of computation that could benefit greatly from being shifted onto the GPU. However, I've only succeeded in slowing down the computation by a factor of 10!
Here are the specifics of my test system:
OS: Windows 10
CPU: Core i7-4700MQ # 2.40 GHz
GPU: GeForce GT 750M (compute capability 3.0)
CUDA SDK: v7.5
The code below performs equivalent calcs to my production code, on the CPU and on the GPU. The latter is consistently ten times slower on my machine (CPU approx. 650ms; GPU approx. 7s).
I've tried changing the grid and block sizes; I've increased and decreased the size of the array passed to the GPU; I've run it through the visual profiler; I've tried integer data rather than doubles, but whatever I do, the GPU version is always significantly slower than the CPU equivalent.
So why is the GPU version so much slower and what changes, that I've not mentioned above, could I try to improve its performance?
Here's my command line: nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30
And here's the contents of source.cu:
#include <iostream>
#include <windows.h>
#include <cuda_runtime_api.h>
void AdjustArrayOnCPU(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
for (size_t i = 0; i < arrayLength; i++)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[i] / denominator);
array[i] = array[i] * adjustmentFactor;
}
}
__global__ void CudaKernel(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayLength)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[idx] / denominator);
array[idx] = array[idx] * adjustmentFactor;
}
}
void AdjustArrayOnGPU(double array[], int arrayLength, double factor1, double factor2, double factor3, double denominator, double curve[], int curveLength)
{
double *dev_row, *dev_curve;
cudaMalloc((void**)&dev_row, sizeof(double) * arrayLength);
cudaMalloc((void**)&dev_curve, sizeof(double) * curveLength);
cudaMemcpy(dev_row, array, sizeof(double) * arrayLength, cudaMemcpyHostToDevice);
cudaMemcpy(dev_curve, curve, sizeof(double) * curveLength, cudaMemcpyHostToDevice);
CudaKernel<<<100, 1000>>>(factor1, factor2, factor3, denominator, dev_row, arrayLength, dev_curve, curveLength);
cudaMemcpy(array, dev_row, sizeof(double) * arrayLength, cudaMemcpyDeviceToHost);
cudaFree(dev_curve);
cudaFree(dev_row);
}
void FillArray(int length, double row[])
{
for (size_t i = 0; i < length; i++) row[i] = 0.1 + i;
}
int main(void)
{
const int arrayLength = 10000;
double arrayForCPU[arrayLength], curve1[arrayLength], arrayForGPU[arrayLength], curve2[arrayLength];;
FillArray(arrayLength, curve1);
FillArray(arrayLength, curve2);
///////////////////////////////////// CPU Version ////////////////////////////////////////
LARGE_INTEGER StartingTime, EndingTime, ElapsedMilliseconds, Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForCPU);
AdjustArrayOnCPU(1.0, 1.0, 1.0, 1.0, arrayForCPU, 10000, curve1, 10000);
}
QueryPerformanceCounter(&EndingTime);
ElapsedMilliseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMilliseconds.QuadPart *= 1000;
ElapsedMilliseconds.QuadPart /= Frequency.QuadPart;
std::cout << "Elapsed Milliseconds: " << ElapsedMilliseconds.QuadPart << std::endl;
///////////////////////////////////// GPU Version ////////////////////////////////////////
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForGPU);
AdjustArrayOnGPU(arrayForGPU, 10000, 1.0, 1.0, 1.0, 1.0, curve2, 10000);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "CUDA Elapsed Milliseconds: " << elapsedTime << std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
And here is an example of the output of CUDASpeedTest.exe
Elapsed Milliseconds: 565
CUDA Elapsed Milliseconds: 7156.76

What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology.
The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics.
The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host.
As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1.
The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment
to an array element. Something this simple must be performed millions of times
before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time.

Related

Why cublas copy algorithm is so fast in cuda?

I wrote kernel that copied input vector to output vector.
But the performance is not enough compared to cublascopy API.
The cublasScopy is almost 100 times faster than my kernel in case of 1M elements.
Anyone knows about algorithm of cublascopy?
__global__ void copy_kernel(const float *rv1, int inc1, float *rvo, int inco, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < n)
{
rvo[tid*inco] = rv1[tid*inc1];
tid += (blockDim.x * gridDim.x);
}
}
Thanks for Robert's help.
I found that measurement code had a bug.
I have to add cudaDeviceSynchronize() only for measuring performance.
And then above my kernel is a little bit slower than cublasScopy.
I think it is reasonable.
// cuBLAS Algorithm
timer.onTimer(4);
cublasScopy(handle, num_data, d_i_vals, inc1, d_o_vals, inco);
cudaDeviceSynchronize(); // this dummy line is needed only for measurement purpose
timer.offTimer(4);
timer.onTimer(5);
checkCudaErrors(cudaMemcpy(o_vals, d_o_vals, sizeof(float) * o_buf_size,
cudaMemcpyDeviceToHost));
cudaDeviceSynchronize();
timer.offTimer(5);

How can I increase the limit of generated prime numbers with Sieve of Eratosthenes?

What do I need to change in my program to be able to compute a higher limit of prime numbers?
Currently my algorithm works only with numbers up to 85 million. Should work with numbers up to 3 billion in my opinion.
I'm writing my own implementation of the Sieve of Eratosthenes in CUDA and I've hit a wall.
So far the algorithm seems to work fine for small numbers (below 85 million).
However, when I try to compute prime numbers up to 100 million, 2 billion, 3 billion, the system freezes (while it's computing stuff in the CUDA device), then after a few seconds, my linux machine goes back to normal (unfrozen), but the CUDA program crashes with the following error message:
CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
I have a GTX 780 (3 GB) and I'm allocating the sieves in a char array, so if I were to compute prime numbers up to 100,000, it would allocate 100,000 bytes in the device.
I assumed that the GPU would allow up to 3 billion numbers since it has 3 GB of memory, however, it only lets me do 85 million tops (85 million bytes = 0.08 GB)
this is my prime.cu code:
#include <stdio.h>
#include <helper_cuda.h> // checkCudaErrors() - NVIDIA_CUDA-6.0_Samples/common/inc
// #include <cuda.h>
// #include <cuda_runtime_api.h>
// #include <cuda_runtime.h>
typedef unsigned long long int uint64_t;
/******************************************************************************
* kernel that initializes the 1st couple of values in the primes array.
******************************************************************************/
__global__ static void sieveInitCUDA(char* primes)
{
primes[0] = 1; // value of 1 means the number is NOT prime
primes[1] = 1; // numbers "0" and "1" are not prime numbers
}
/******************************************************************************
* kernel for sieving the even numbers starting at 4.
******************************************************************************/
__global__ static void sieveEvenNumbersCUDA(char* primes, uint64_t max)
{
uint64_t index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 4;
if (index < max)
primes[index] = 1;
}
/******************************************************************************
* kernel for finding prime numbers using the sieve of eratosthenes
* - primes: an array of bools. initially all numbers are set to "0".
* A "0" value means that the number at that index is prime.
* - max: the max size of the primes array
* - maxRoot: the sqrt of max (the other input). we don't wanna make all threads
* compute this over and over again, so it's being passed in
******************************************************************************/
__global__ static void sieveOfEratosthenesCUDA(char *primes, uint64_t max,
const uint64_t maxRoot)
{
// get the starting index, sieve only odds starting at 3
// 3,5,7,9,11,13...
/* int index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 3; */
// apparently the following indexing usage is faster than the one above. Hmm
int index = blockIdx.x * blockDim.x + threadIdx.x + 3;
// make sure index won't go out of bounds, also don't start the execution
// on numbers that are already composite
if (index < maxRoot && primes[index] == 0)
{
// mark off the composite numbers
for (int j = index * index; j < max; j += index)
{
primes[j] = 1;
}
}
}
/******************************************************************************
* checkDevice()
******************************************************************************/
__host__ int checkDevice()
{
// query the Device and decide on the block size
int devID = 0; // the default device ID
cudaError_t error;
cudaDeviceProp deviceProp;
error = cudaGetDevice(&devID);
if (error != cudaSuccess)
{
printf("CUDA Device not ready or not supported\n");
printf("%s: cudaGetDevice returned error code %d, line(%d)\n", __FILE__, error, __LINE__);
exit(EXIT_FAILURE);
}
error = cudaGetDeviceProperties(&deviceProp, devID);
if (deviceProp.computeMode == cudaComputeModeProhibited || error != cudaSuccess)
{
printf("CUDA device ComputeMode is prohibited or failed to getDeviceProperties\n");
return EXIT_FAILURE;
}
// Use a larger block size for Fermi and above (see compute capability)
return (deviceProp.major < 2) ? 16 : 32;
}
/******************************************************************************
* genPrimesOnDevice
* - inputs: limit - the largest prime that should be computed
* primes - an array of size [limit], initialized to 0
******************************************************************************/
__host__ void genPrimesOnDevice(char* primes, uint64_t max)
{
int blockSize = checkDevice();
if (blockSize == EXIT_FAILURE)
return;
char* d_Primes = NULL;
int sizePrimes = sizeof(char) * max;
uint64_t maxRoot = sqrt(max);
// allocate the primes on the device and set them to 0
checkCudaErrors(cudaMalloc(&d_Primes, sizePrimes));
checkCudaErrors(cudaMemset(d_Primes, 0, sizePrimes));
// make sure that there are no errors...
checkCudaErrors(cudaPeekAtLastError());
// setup the execution configuration
dim3 dimBlock(blockSize);
dim3 dimGrid((maxRoot + dimBlock.x) / dimBlock.x);
dim3 dimGridEvens(((max + dimBlock.x) / dimBlock.x) / 2);
//////// debug
#ifdef DEBUG
printf("dimBlock(%d, %d, %d)\n", dimBlock.x, dimBlock.y, dimBlock.z);
printf("dimGrid(%d, %d, %d)\n", dimGrid.x, dimGrid.y, dimGrid.z);
printf("dimGridEvens(%d, %d, %d)\n", dimGridEvens.x, dimGridEvens.y, dimGridEvens.z);
#endif
// call the kernel
// NOTE: no need to synchronize after each kernel
// http://stackoverflow.com/a/11889641/2261947
sieveInitCUDA<<<1, 1>>>(d_Primes); // launch a single thread to initialize
sieveEvenNumbersCUDA<<<dimGridEvens, dimBlock>>>(d_Primes, max);
sieveOfEratosthenesCUDA<<<dimGrid, dimBlock>>>(d_Primes, max, maxRoot);
// check for kernel errors
checkCudaErrors(cudaPeekAtLastError());
checkCudaErrors(cudaDeviceSynchronize());
// copy the results back
checkCudaErrors(cudaMemcpy(primes, d_Primes, sizePrimes, cudaMemcpyDeviceToHost));
// no memory leaks
checkCudaErrors(cudaFree(d_Primes));
}
to test this code:
int main()
{
int max = 85000000; // 85 million
char* primes = malloc(max);
// check that it allocated correctly...
memset(primes, 0, max);
genPrimesOnDevice(primes, max);
// if you wish to display results:
for (uint64_t i = 0; i < size; i++)
{
if (primes[i] == 0) // if the value is '0', then the number is prime
{
std::cout << i; // use printf if you are using c
if ((i + 1) != size)
std::cout << ", ";
}
}
free(primes);
}
This error:
CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
doesn't necessarily mean anything other than that your kernel is taking too long. It's not necessarily a numerical limit, or computational error, but a system-imposed limit on the amount of time your kernel is allowed to run. Both Linux and windows can have such watchdog timers.
If you want to work around it in the linux case, review this document.
You don't mention it, but I assume your GTX780 is also hosting a (the) display. In that case, there is a time limit on kernels by default. If you can use another device as the display, then reconfigure your machine to have X not use the GTX780, as described in the link. If you do not have another GPU to use for the display, then the only option is to modify the interactivity setting indicated in the linked document, if you want to run long-running kernels. And in this situation, the keyboard/mouse/display will become non-responsive while the kernel is running. If your kernel should happen to run too long, it can be difficult to recover the machine, and may require a hard reboot. (You could also SSH into the machine, and kill the process that is using the GPU for CUDA.)

CUDA performance of atomic operation on different address in warp

To my knowledge, if atomic operations are performed on same memory address location in a warp, the performance of the warp could be 32 times slower.
But what if atomic operations of threads in a warp are on 32 different memory locations? Is there any performance penalty at all? Or it will be as fast as normal operation?
My use case is that I have 32 different positions, each thread in a warp needs one of these position but which position is data dependent. So each thread could use atomicCAS to scan if the location desired is empty or not. If it is not empty, scan the next position.
If I am lucky, 32 threads could atomicCAS to 32 different memory locations, is there any performance penalty is this case?
I assume Kepler architecture is used
In the code below, I'm adding a constant value to the elements of an array (dev_input). I'm comparing two kernels, one using atomicAdd and one using regular addition. This is an example taken to the extreme in which atomicAdd operates on completely different addresses, so there will be no need for serialization of the operations.
#include <stdio.h>
#define BLOCK_SIZE 1024
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void regular_addition(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) dev_input[i] = dev_input[i] + val;
}
__global__ void atomic_operations(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) atomicAdd(&dev_input[i],val);
}
int main(){
int N = 8192*32;
float* output = (float*)malloc(N*sizeof(float));
float* dev_input; gpuErrchk(cudaMalloc((void**)&dev_input, N*sizeof(float)));
gpuErrchk(cudaMemset(dev_input, 0, N*sizeof(float)));
int NumBlocks = iDivUp(N,BLOCK_SIZE);
float time, timing1 = 0.f, timing2 = 0.f;
cudaEvent_t start, stop;
int niter = 32;
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
atomic_operations<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing1 = timing1 + time;
}
printf("Time for atomic operations: %3.5f ms \n", timing1/(float)niter);
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
regular_addition<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing2 = timing2 + time;
}
printf("Time for regular addition: %3.5f ms \n", timing2/(float)niter);
}
Testing this code on my NVIDIA GeForce GT540M, CUDA 5.5, Windows 7, I obtain approximately the same results for the two kernels, i.e., about 0.7ms.
Now change the instruction
if (i < N) atomicAdd(&dev_input[i],val);
to
if (i < N) atomicAdd(&dev_input[i%32],val);
which is closer to the case of your interest, namely, each atomicAdd operates on different addresses within a warp. The result I obtain is that no performance penalty is observed.
Finally, change the above instruction to
if (i < N) atomicAdd(&dev_input[0],val);
This is the other extreme in which atomicAdd always operates on the same address. In this case, the execution time raises to 5.1ms.
The above tests have been performed on a Fermi architecture. You can try to run the above code on your Kepler card.

OpenCL slow -- not sure why

I'm teaching myself OpenCL by trying to optimize the mpeg4dst reference audio encoder. I achieved a 3x speedup by using vector instructions on CPU but I figured the GPU could probably do better.
I'm focusing on computing auto-correlation vectors in OpenCL as my first area of improvement. The CPU code is:
for (int i = 0; i < NrOfChannels; i++) {
for (int shift = 0; shift <= PredOrder[ChannelFilter[i]]; shift++)
vDSP_dotpr(Signal[i] + shift, 1, Signal[i], 1, &out, NrOfChannelBits - shift);
}
NrOfChannels = 6
PredOrder = 129
NrOfChannelBits = 150528.
On my test file, this function take approximately 188ms to complete.
Here's my OpenCL method:
kernel void calculateAutocorrelation(size_t offset,
global const float *input,
global float *output,
size_t size) {
size_t index = get_global_id(0);
size_t end = size - index;
float sum = 0.0;
for (size_t i = 0; i < end; i++)
sum += input[i + offset] * input[i + offset + index];
output[index] = sum;
}
This is how it is called:
gcl_memcpy(gpu_signal_in, Signal, sizeof(float) * NrOfChannels * MAXCHBITS);
for (int i = 0; i < NrOfChannels; i++) {
size_t sz = PredOrder[ChannelFilter[i]] + 1;
cl_ndrange range = { 1, { 0, 0, 0 }, { sz, 0, 0}, { 0, 0, 0 } };
calculateAutocorrelation_kernel(&range, i * MAXCHBITS, (cl_float *)gpu_signal_in, (cl_float *)gpu_out, NrOfChannelBits);
gcl_memcpy(out, gpu_out, sizeof(float) * sz);
}
According to Instruments, my OpenCL implementation seems to take about 13ms, with about 54ms of memory copy overhead (gcl_memcpy).
When I use a much larger test file, 1 minute of 2-channel music vs, 1 second of 6-channel, while the measured performance of the OpenCL code seems to be the same, the CPU usage falls to about 50% and the whole program takes about 2x longer to run.
I can't find a cause for this in Instruments and I haven't read anything yet that suggests that I should expect very heavy overhead switching in and out of OpenCL.
If I'm reading your kernel code correctly, each work item is iterating over all of the data from it's location to the end. This isn't going to be efficient. For one (and the primary performance concern), the memory accesses won't be coalesced and so won't be at full memory bandwidth. Secondly, because each work item has a different amount of work, there will be branch divergence within a work group, which will leave some threads idle waiting for others.
This seems like it has a lot in common with a reduction problem and I'd suggest reading up on "parallel reduction" to get some hints about doing an operation like this in parallel.
To see how memory is being read, work out how 16 work items (say, global_id 0 to 15) will be reading data for each step.
Note that if every work item in a work group access the same memory, there is a "broadcast" optimization the hardware can make. So just reversing the order of your loop could improve things.

max FLOPS for matrix multiplication Intel/AMD CPU

My formula for estimating the maximum FLOPs/s of an Intel CPU is
Max SP FLOPs/s = frequencey * 4 SSE(8AVX) * 2 (MAC) * number of cores (not HW threads)
Max DP FLOPs/s = 0.5 * Max SP FLOPs/s
By MAC I mean the CPU can do one SSE(AVX) multiplication and addition at the same time. On the system I'm using the maximum frequency under load is 2.66 GHz. It only has SSE and the number of cores (not Hardware threads) is 4. That gives: Max SP FLOPs/s = 85.12 GFLOPs/s.
The number of FLOPs for matrix multiplication is approxiamelty 2*n*m*k. For a square matrix with n=1000 that's 2*10E9 FLOPs (2 billion FLOPs). Once I know the time I can estimate the FLOPs/s.
However, the best I can get for my own code is about 40 SP GFLOPs/s for example with n=1000. I get about the same result with Eigen. That's about a 45% efficiency. Is my calculation for the maximum wrong? What's the best efficiency for a Intel CPU for large dense matrix multiplication? Does anyone have a paper describing this?
I know that on the GPU the efficiency can be more than 60%.
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3
Edit:
I get similar results for n=500 which easily fits in the 12MB L3 cache of my system so the cache does not seem to be the limiting factor (though maybe I can use it more efficiently).
Edit2:
Eigen Benchmarks show it as good as MKL (for SSE). They use a Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz. So 2.66* 2(DP SSE) *2 MAC * 4 cores = 42.25 DP GFLOPs/s. You can see on the plot they are all getting less that 20. Something on order of 45% like me.
http://eigen.tuxfamily.org/index.php?title=Benchmark
http://ark.intel.com/products/35365/Intel-Core2-Quad-Processor-Q9400-6M-Cache-2_66-GHz-1333-MHz-FSB
Edit3:
Here is my code for anyone that cares. I can get slight better results than this but not much better. I'm using Agner Fog's vectorclass for SEE/AVX. and setting Vec8f to float8 and Vec4d to double4
//SGEMM and AVX call MM_tile<float, float8>(nthreads, a, b, c, n, m, k);
template <typename ftype, typename floatn>
void GEMM_tile(const int nthreads, const ftype*A , const ftype* B, ftype* C, const int N, const int M, const int K) {
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
C[K*i + j] = 0;
}
}
const int nc = 32;
const int kc = 32;
const int mc = 32;
omp_set_num_threads(nthreads);
#pragma omp parallel for if(nthreads>1)
for(int ii=0; ii<N; ii+=nc) {
for(int jj=0; jj<K; jj+=kc)
for(int ll=0; ll<M; ll+=mc) {
const int nb = min(N-ii, nc);
const int kb = min(K-jj, kc);
const int mb = min(M-ll, mc);
MM_block<ftype, floatn>(nb, mb, kb, &A[M*ii+ll], N, &B[K*ll+jj], K, &C[K*ii+jj], K );
}
}
}
template <typename ftype, typename floatn>
void MM_block(int n, int m, int k, const ftype *a, const int stridea,
const ftype *b, const int strideb,
ftype *c, const int stridec ) {
const int vec_size = sizeof(floatn)/sizeof(ftype);
for(int i=0; i<n; i+=4) {
for(int j=0; j<k; j+=vec_size) {
Dot4x4_vec_block<ftype, floatn>(m, &a[strideb*i], &b[j], &c[stridec*i + j], stridea, strideb, stridec);
}
}
template <typename ftype, typename floatn>
inline void Dot4x4_vec_block(const int n, const ftype *a, const ftype *b, ftype *c, const int stridea, const int strideb, const int stridec) {
floatn tmp0, tmp1, tmp2, tmp3;
load(tmp0, &c[stridec*0]);
load(tmp1, &c[stridec*1]);
load(tmp2, &c[stridec*2]);
load(tmp3, &c[stridec*3]);
ftype *a0_ptr = (ftype*)&a[stridea*0];
ftype *a1_ptr = (ftype*)&a[stridea*1];
ftype *a2_ptr = (ftype*)&a[stridea*2];
ftype *a3_ptr = (ftype*)&a[stridea*3];
for(int i=0; i<n; i++) {
floatn breg = floatn().load(&b[i*strideb + 0]);
floatn areg0 = *a0_ptr++;
floatn areg1 = *a1_ptr++;
floatn areg2 = *a2_ptr++;
floatn areg3 = *a3_ptr++;
tmp0 += areg0 * breg;
tmp1 += areg1 * breg;
tmp2 += areg2 * breg;
tmp3 += areg3 * breg;
}
tmp0.store(&c[stridec*0]);
tmp1.store(&c[stridec*1]);
tmp2.store(&c[stridec*2]);
tmp3.store(&c[stridec*3]);
}
Often, the limiting factor for processing throughput is memory bandwidth, especially in cases where your working set doesn't fit into the CPU cache (your 1000-by-1000 matrix of float will take up ~4 MB, whereas your CPU probably has a 2 MB L3 cache). This is a situation where the structure of your algorithm can make a big difference in how it performs, but you will usually hit a wall at some point where you just can't get any faster because you're waiting on values to come from some higher level in the memory hierarchy.
In addition, your theoretical numbers assume that you have sufficient instructions without data dependencies to keep all of the execution units tasked on every cycle. This can be very difficult to do in practice. I'm not sure what the optimum throughput for a general matrix multiply would be, but check out this previous question for information on what you can do to maximize the instruction throughput.

Resources