OpenMP Code Not Scaling due to overheads and cache issues - windows

struct xnode
{
float *mat;
};
void testScaling( )
{
int N = 1000000; ///total num matrices
int dim = 10;
//memory for matrices
std::vector<xnode> nodeArray(N);
for( int k = 0; k < N; ++k )
nodeArray[k].mat = new float [dim*dim];
//memory for Y
std::vector<float*> Y(N,0);
for( int k = 0; k < N; ++k )
Y[k] = new float [dim];
//shared X
float* X = new float [dim];
for(int i = 0; i < dim; ++i ) X[i] = 1.0;
//init mats
for( int k = 0; k < N; ++k )
{
for( int i=0; i<dim*dim; ++i )
nodeArray[k].mat[i] = 0.25+((float)i)/3;
}
int NTIMES = 500;
//gemv args
char trans = 'N';
int lda = dim;
int incx = 1;
float alpha =1 , beta = 0;
//threads
int thr[4];
thr[0] =1 ; thr[1] = 2; thr[2] = 4; thr[3] = 8;
for( int t = 0; t<4; ++t )//test for nthreads
{
int nthreads = thr[t];
double t_1 = omp_get_wtime();
for( int ii = 0; ii < NTIMES; ++ii )//do matvec NTIMES
{
#pragma omp parallel for num_threads(nthreads)
for( int k=0; k<N; ++k )
{
//compute Y[k] = mat[k] * X;
GEMV(&trans, &dim, &dim, &alpha, nodeArray[k].mat, &lda, X, &incx, &beta, Y[k], &incx);
//GEMV(&trans, &dim, &dim, &alpha, nodeArray[0].mat, &lda, X, &incx, &beta, Y[k], &incx);
}
}
double t_2 = omp_get_wtime();
std::cout << "Threads " << nthreads << " time " << (t_2-t_1)/NTIMES << std::endl;
}
//clear memory
for( int k = 0; k < N; ++k )
{
delete [] nodeArray[k].mat;
delete [] Y[k];
}
delete [] X;
}
The above code parallelizes the matrix-vector product of N matrices of size dim, and stores results in N output vectors. The average of 500 products is taken as the time per matrix-vector product. The matrix-vector products in the above example are all of equal size and thus the threads should be perfectly balanced - we should achieve a performance scaling close to ideal 8x. The following are the observations (Machine – Intel Xeon 3.1Ghz.2 processors,8cores each, HyperThreading enabled, Windows, VS2012, Intel MKL, Intel OMP library).
OBSERVATION 1:
dim=10 N=1000000
Threads 1 - time 0.138068s
Threads 2 - time 0.0729147s
Threads 4 - time 0.0360527s
Threads 8 - time 0.0224268s (6.1x on 8threads)
OBSERVATION 2 :
dim=20 N=1000000
Threads 1 time 0.326617
Threads 2 time 0.185706
Threads 4 time 0.0886508
Threads 8 time 0.0733666 (4.5x on 8 threads).
Note – I ran VTune on this case. It showed CPUTime 267.8sec, Overhead time 43 sec, Spin time – 8 sec. The overhead time is all spent in a libiomp function (intel library). 8Threads/1Thread scaling is poor for such cases.
Next - in the gemv for loop, we change nodeArray[k].mat to nodeArray[0].mat (see commented statement), so that only the first matrix is used for all the matrix-vector products.
OBSERVATION 3
dim=20 N=1000000
Threads 1 time 0.152298 (The serial time is halved)
Threads 2 time 0.0769173
Threads 4 time 0.0384086
Threads 8 time 0.019336 (7.87x on 8 threads)
Thus I get almost ideal scaling - why is this behavior? VTune says that a significant portion of CPU time is spent in synchronization and thread overhead. Here it seems there is no relation between the load balancing and thread synchronization. As matrix size is increased the granularity should increase and thread overhead should be proportionately small. But as we increase from size 10 to 20 the scaling is weakening. When we use nodeArray[0].mat (only the first matrix) for doing all the matrix-vector products the cache is updated only once (since the compiler knows this during optimization) and we get near ideal scaling. Thus the synchronization overhead seems to be related to some cache related issue. I have tried a number of other things like setting KMP_AFFINITY and varying load distribution but that did not buy me anything.
My questions are:
1. I dont have a clear idea about how does the cache performance affect openMP thread synchronization. Can someone explain this?
2. Can anything be done about improving the scaling and reducing the overhead?
Thanks

Related

Matrix multiplication via std::vector is 10 times slower than numpy

Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.
I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().
Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.
We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.
Python3 code
import numpy as np
import time
import sys
if (len(sys.argv) != 2):
print("Pass `size` as an argument.", file = sys.stderr);
sys.exit(1);
size = int(sys.argv[1]);
A = np.ndarray((size, size));
B = np.ndarray((size, size));
for i in range(size):
for j in range(size):
A[i][j] = i * 3.14 + j
B[i][j] = i * 3.14 - j
start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);
C++ code
using namespace std;
#include <iostream>
#include <vector>
#include <chrono>
int main(int argc, char **argv) {
if (argc != 2) {
cerr << "Pass `size` as an argument.\n";
return 1;
}
const unsigned size = atoi(argv[1]);
vector<vector<double>> A(size, vector<double>(size));
vector<vector<double>> B(size, vector<double>(size));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = i * 3.14 + j;
B[i][j] = i * 3.14 - j;
}
}
auto start = chrono::system_clock::now();
vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
cerr << scientific;
cerr.precision(3);
cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";
}
C++ code (multithreaded)
We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.
You can get all the codes from GitHub.
Result
C++ version is 10+ times slower than Python 3 (with numpy) version.
matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
C++ (1 thread): 8.66e-03
C++ (2 threads): 4.68e-03
C++ (3 threads): 3.14e-03
C++ (4 threads): 2.43e-03
Python 3: 4.07e-04
-----------------------------------------------
matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
C++ (1 thread): 6.985e-02
C++ (2 threads): 3.647e-02
C++ (3 threads): 2.462e-02
C++ (4 threads): 1.915e-02
Python 3: 1.466e-03
-----------------------------------------------
Question
Is there any way to make the C++ implementation faster?
Optimizations I Tried
swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)
optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.
optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.
optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)
Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.
g++ -funroll-loops flag to unroll for loops -> no change
g++ #pragma GCC unroll n -> no change
g++ -flto flag to turn on link time optimizations -> no change
Block Algorithm -> no change
transpose B to avoid cache miss -> no change
long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster
Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster
Optimization 3, plus PGO -> same as Optimization 3
Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3
Current Status
(originally) 13.06 times slower -> (currently) 2.10 times slower
Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.
Original Code (GitHub)
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
const unsigned j_max = B[0].size();
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int j = 0; j < j_max; ++j) {
for (int k = 0; k < k_max; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
Current Best Code (GitHub)
This is the implementation of the Optimization 5 above.
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
static const unsigned num_unroll = 5;
const unsigned j_max = B[0].size();
const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
for (int j = 0; j < j_max; ++j) {
const double *p = A[i].data() + k;
double sum;
sum = *p++ * B[k][j];
sum += *p++ * B[k+1][j];
sum += *p++ * B[k+2][j];
sum += *p++ * B[k+3][j];
sum += *p++ * B[k+4][j];
C[i][j] += sum;
}
}
for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
const double a = A[i][k];
for (int j = 0; j < j_max; ++j) {
C[i][j] += a * B[k][j];
}
}
}
}
We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.
We (want to) believe we can make it better, though.
Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:
Use SIMDInstructions
Use Register Blocking and fetch multiple data at once
Optimize for your chache lines (mainly L2 and L3)
Parallelize your code to use multiple threads
Under this linke is a very good ressource, that explains all the nasty details:
https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
If you want more indepth advise leave a comment.

omp parallel for loop (reduction to find max) ran slower than serial codes

I am new in using OpenMP.
I think that use max reduction clause to find the max element of an array is not such a bad idea, but in fact the parallel for loop ran much slower than serial one.
int main() {
double sta, end, elapse_t;
int bsize = 46000;
int q = bsize;
int max_val = 0;
double *buffer;
buffer = (double*)malloc(bsize*sizeof(double));
srand(time(NULL));
for(int i=0;i<q;i++)
buffer[i] = rand()%10000;
sta = omp_get_wtime();
//int i;
#pragma omp parallel for reduction(max : max_val)
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("parallel maximum time %f\n", end-sta);
sta = omp_get_wtime();
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("serial maximum time %f\n", end-sta);
free(buffer);
return 0;}
Compile command
gcc-7 kp_omp.cpp -o kp_omp -fopenmp
Execution results
./kp_omp
parallel maximum time 0.000505
serial maximum time 0.000266
As for the CPU, it is an Intel Core i7-6700 with 8 cores.
Whenever you parallelise a loop openMP needs to perform some operations, for example creating the threads. Those operations result in some overhead and this in turns implies that, for each loop, there is a minimum number of iterations under which it is not convenient to parallelise.
If I execute your code I obtain the same results you have:
./kp_omp
parallel maximum time 0.000570
serial maximum time 0.000253
However if I modify bsize in line 8 such that
int bsize = 100000;
I obtain
./kp_omp
parallel maximum time 0.000323
serial maximum time 0.000552
So the parallelised version got faster than the sequential. Part of the challenges you encounter when you try to speedup the execution of a code is to understand when it is convenient to parallelise and when the overhead of the parallelisation would kill your expected gain in performance.

MPI Latency measuring

I am trying to understand some aspects of the MPI.
During the creation of the program, which is to measure latency between send/recv of two processes, I was faced with strange effects.
I tried to measure the result of many iterations, and received a response that matches the other benchmarks. Then I decided to display values ​​after each iteration and was surprised: they ranged between four values ​​that have not changed. I also drew attention to some very high values.
The code that calculates the value of latency and sample values is below:
int main()
{
MPI::Init();
Proc_Rank = MPI::COMM_WORLD.Get_rank();
for(int i = 0; i < 100; ++i)
latency_test(Proc_Rank, 1, 0);
MPI::Finalize();
return 0;
}
void latency_test(int Proc_Rank, int Iterations_Num, int Size)
{
double Total_Time, Latency;
double t1, t2;
char *Send_Buffer = new char[Size];
char *Recv_Buffer = new char[Size];
for(int i = 0; i < Size; i++){
Send_Buffer[i] = 'a';
}
for(int i = 0; i < Size; i++){
Recv_Buffer[i] = 'b';
}
MPI::COMM_WORLD.Barrier();
t1 = MPI::Wtime();
for(int i = 0; i < Iterations_Num; i++){
if (Proc_Rank == 0){
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 1, 0);
MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,1,
MPI::ANY_TAG);
}
else if (Proc_Rank==1) { MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,0,MPI::ANY_TAG);
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 0, 0);
}
}
t2 = MPI::Wtime();
delete []Send_Buffer;
delete []Recv_Buffer;
Total_Time = t2-t1;
if(Proc_Rank == 0){
Latency = (Total_Time / (Iterations_Num * 2.0)) * 1000000.0;
printf("%10.10f\n", Latency);
}
}
Part of the result:
5.4836273193
1.0728836060
0.9536743164
1.0728836060
0.4768371582
0.9536743164
0.5960464478
6.5565109253
0.9536743164
0.9536743164
1.0728836060
0.5960464478
0.4768371582
0.4768371582
Why are 4 fixed values randomly repeat? And why there are rare very large values?
As pointed out by Zulan, the resolution of the timer used by MPI_Wtime is not infinite. You can query the timer resolution by calling MPI_Wtick (MPI::Wtick in the C++ bindings). Measuring a single ping-pong round that lasts less than a microsecond is prone to very high statistical uncertainty, especially since the OS jitter, which is the random delay of the process execution due to other OS activities or processes being scheduled on the same CPU, could be several microseconds. No respectable MPI benchmark would do a single ping-pong round with empty messages.
As a side note, you are using a wildcard receive (MPI_ANY_TAG) in one of the processes. Those tend to be slower than fully-specified receives, especially when it comes to network equipment.

Improving the Efficiency of Compact/Scatter in CUDA

Summary:
Any ideas about how to further improve upon the basic scatter operation in CUDA? Especially if one knows it will only be used to compact a larger array into a smaller one? or why the below methods of vectorizing memory ops and shared memory didn't work? I feel like there may be something fundamental I am missing and any help would be appreciated.
EDIT 03/09/15: So I found this Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics". I had assumed atomics would be intrinsically slower for this purpose, however I was wrong - especially since I don't think I care about maintaining element order in the array during my simulation. I'll have to think about it some more and then implement it to see what happens!
EDIT 01/04/16: I realized I never wrote about my results. Unfortunately in that Parallel for All Blog post they compared the global atomic method for compact to the Thrust prefix-sum compact method, which is actually quite slow. CUB's Device::IF is much faster than Thrust's - as is the prefix-sum version I wrote using CUB's Device::Scan + custom code. The warp-aggregrate global atomic method is still faster by about 5-10%, but nowhere near the 3-4x faster I had been hoping for based on the results in the blog. I'm still using the prefix-sum method as while maintaining element order is not necessary, I prefer the consistency of the prefix-sum results and the advantage from the atomics is not very big. I still try various methods to improve compact, but so far only marginal improvements (2%) at best for dramatically increased code complexity.
Details:
I am writing a simulation in CUDA where I compact out elements I am no longer interested in simulating every 40-60 time steps. From profiling it seems that the scatter op takes up the most amount of time when compacting - more so than the filter kernel or the prefix sum. Right now I use a pretty basic scatter function:
__global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
freq_Index is the number of elements in the old array. The flag array is the result from the filter. Scan_ID is the result from the prefix sum on the flag array.
Attempts I've made to improve it are to read the flagged frequencies into shared memory first and then write from shared memory to global memory - the idea being that the writes to global memory would be more coalesced amongst the warps (e.g. instead of thread 0 writing to position 0 and thread 128 writing to position 1, thread 0 would write to 0 and thread 1 would write to 1). I also tried vectorizing the reads and the writes - instead of reading and writing floats/ints I read/wrote float4/int4 from the global arrays when possible, so four numbers at a time. This I thought might speed up the scatter by having fewer memory ops transferring larger amounts of memory. The "kitchen sink" code with both vectorized memory loads/stores and shared memory is below:
const int compact_threads = 256;
__global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int gID = blockIdx.x*blockDim.x + threadIdx.x; //global ID
int tID = threadIdx.x; //thread ID within block
__shared__ float row[4*compact_threads];
__shared__ int start_index[1];
__shared__ int end_index[1];
float4 myResult;
int st_index;
int4 myFlag;
int4 index;
for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
if(tID == 0){
index = reinterpret_cast<const int4*>(scan_Index)[id];
myFlag = reinterpret_cast<const int4*>(flag)[id];
start_index[0] = index.x;
st_index = index.x;
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[0] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
}
__syncthreads();
if(tID > 0){
myFlag = reinterpret_cast<const int4*>(flag)[id];
st_index = start_index[0];
index = reinterpret_cast<const int4*>(scan_Index)[id];
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[index.x-st_index] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
}
__syncthreads();
int count = end_index[0] - st_index;
int rem = st_index & 0x3; //equivalent to modulo 4
int offset = 0;
if(rem){ offset = 4 - rem; }
if(tID < offset && tID < count){
new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
}
int tempID = 4*tID+offset;
if((tempID+3) < count){
reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
}
tempID = tID + offset + (count-offset)/4*4;
if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
}
int id = gID + freq_Index/4 * 4;
if(id < freq_Index){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
Obviously it gets a bit more complicated. :) While the above kernel seems stable when there are hundreds of thousands of elements in the array, I've noticed a race condition when the array numbers in the tens of millions. I'm still trying to track the bug down.
But regardless, neither method (shared memory or vectorization) together or alone improved performance. I was especially surprised by the lack of benefit from vectorizing the memory ops. It had helped in other functions I had written, though now I am wondering if maybe it helped because it increased Instruction-Level-Parallelism in the calculation steps of those other functions rather than the fewer memory ops.
I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.
//kernel phase 1
int myID = blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int mask;
unsigned int cnt=0;//;//
for(int j = 0; j < 32; j++){
int index = (warpID<<10)+(j<<5)+lnID;
bool pred;
if(index > true_length) pred = false;
else pred = predicate(input[index]);
mask = __ballot(pred);
if(lnID == 0) {
flag[(warpID<<5)+j] = mask;
cnt += __popc(mask);
}
}
if(lnID == 0) counter[warpID] = cnt; //store sum
}
//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array
//kernel phase 3
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int predmask;
unsigned int cnt;
predmask = flag[(warpID<<5)+lnID];
cnt = __popc(predmask);
//parallel prefix sum
#pragma unroll
for(int offset = 1; offset < 32; offset<<=1){
unsigned int n = __shfl_up(cnt, offset);
if(lnID >= offset) cnt += n;
}
unsigned int global_index = 0;
if(warpID > 0) global_index = scan_Index[warpID - 1];
for(int i = 0; i < 32; i++){
unsigned int mask = __shfl(predmask, i); //broadcast from thread i
unsigned int sub_group_index = 0;
if(i > 0) sub_group_index = __shfl(cnt, i-1);
if(mask & (1 << lnID)){
compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID];
}
}
}
}
EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.

max FLOPS for matrix multiplication Intel/AMD CPU

My formula for estimating the maximum FLOPs/s of an Intel CPU is
Max SP FLOPs/s = frequencey * 4 SSE(8AVX) * 2 (MAC) * number of cores (not HW threads)
Max DP FLOPs/s = 0.5 * Max SP FLOPs/s
By MAC I mean the CPU can do one SSE(AVX) multiplication and addition at the same time. On the system I'm using the maximum frequency under load is 2.66 GHz. It only has SSE and the number of cores (not Hardware threads) is 4. That gives: Max SP FLOPs/s = 85.12 GFLOPs/s.
The number of FLOPs for matrix multiplication is approxiamelty 2*n*m*k. For a square matrix with n=1000 that's 2*10E9 FLOPs (2 billion FLOPs). Once I know the time I can estimate the FLOPs/s.
However, the best I can get for my own code is about 40 SP GFLOPs/s for example with n=1000. I get about the same result with Eigen. That's about a 45% efficiency. Is my calculation for the maximum wrong? What's the best efficiency for a Intel CPU for large dense matrix multiplication? Does anyone have a paper describing this?
I know that on the GPU the efficiency can be more than 60%.
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3
Edit:
I get similar results for n=500 which easily fits in the 12MB L3 cache of my system so the cache does not seem to be the limiting factor (though maybe I can use it more efficiently).
Edit2:
Eigen Benchmarks show it as good as MKL (for SSE). They use a Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz. So 2.66* 2(DP SSE) *2 MAC * 4 cores = 42.25 DP GFLOPs/s. You can see on the plot they are all getting less that 20. Something on order of 45% like me.
http://eigen.tuxfamily.org/index.php?title=Benchmark
http://ark.intel.com/products/35365/Intel-Core2-Quad-Processor-Q9400-6M-Cache-2_66-GHz-1333-MHz-FSB
Edit3:
Here is my code for anyone that cares. I can get slight better results than this but not much better. I'm using Agner Fog's vectorclass for SEE/AVX. and setting Vec8f to float8 and Vec4d to double4
//SGEMM and AVX call MM_tile<float, float8>(nthreads, a, b, c, n, m, k);
template <typename ftype, typename floatn>
void GEMM_tile(const int nthreads, const ftype*A , const ftype* B, ftype* C, const int N, const int M, const int K) {
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
C[K*i + j] = 0;
}
}
const int nc = 32;
const int kc = 32;
const int mc = 32;
omp_set_num_threads(nthreads);
#pragma omp parallel for if(nthreads>1)
for(int ii=0; ii<N; ii+=nc) {
for(int jj=0; jj<K; jj+=kc)
for(int ll=0; ll<M; ll+=mc) {
const int nb = min(N-ii, nc);
const int kb = min(K-jj, kc);
const int mb = min(M-ll, mc);
MM_block<ftype, floatn>(nb, mb, kb, &A[M*ii+ll], N, &B[K*ll+jj], K, &C[K*ii+jj], K );
}
}
}
template <typename ftype, typename floatn>
void MM_block(int n, int m, int k, const ftype *a, const int stridea,
const ftype *b, const int strideb,
ftype *c, const int stridec ) {
const int vec_size = sizeof(floatn)/sizeof(ftype);
for(int i=0; i<n; i+=4) {
for(int j=0; j<k; j+=vec_size) {
Dot4x4_vec_block<ftype, floatn>(m, &a[strideb*i], &b[j], &c[stridec*i + j], stridea, strideb, stridec);
}
}
template <typename ftype, typename floatn>
inline void Dot4x4_vec_block(const int n, const ftype *a, const ftype *b, ftype *c, const int stridea, const int strideb, const int stridec) {
floatn tmp0, tmp1, tmp2, tmp3;
load(tmp0, &c[stridec*0]);
load(tmp1, &c[stridec*1]);
load(tmp2, &c[stridec*2]);
load(tmp3, &c[stridec*3]);
ftype *a0_ptr = (ftype*)&a[stridea*0];
ftype *a1_ptr = (ftype*)&a[stridea*1];
ftype *a2_ptr = (ftype*)&a[stridea*2];
ftype *a3_ptr = (ftype*)&a[stridea*3];
for(int i=0; i<n; i++) {
floatn breg = floatn().load(&b[i*strideb + 0]);
floatn areg0 = *a0_ptr++;
floatn areg1 = *a1_ptr++;
floatn areg2 = *a2_ptr++;
floatn areg3 = *a3_ptr++;
tmp0 += areg0 * breg;
tmp1 += areg1 * breg;
tmp2 += areg2 * breg;
tmp3 += areg3 * breg;
}
tmp0.store(&c[stridec*0]);
tmp1.store(&c[stridec*1]);
tmp2.store(&c[stridec*2]);
tmp3.store(&c[stridec*3]);
}
Often, the limiting factor for processing throughput is memory bandwidth, especially in cases where your working set doesn't fit into the CPU cache (your 1000-by-1000 matrix of float will take up ~4 MB, whereas your CPU probably has a 2 MB L3 cache). This is a situation where the structure of your algorithm can make a big difference in how it performs, but you will usually hit a wall at some point where you just can't get any faster because you're waiting on values to come from some higher level in the memory hierarchy.
In addition, your theoretical numbers assume that you have sufficient instructions without data dependencies to keep all of the execution units tasked on every cycle. This can be very difficult to do in practice. I'm not sure what the optimum throughput for a general matrix multiply would be, but check out this previous question for information on what you can do to maximize the instruction throughput.

Resources