Why BLAS SGEMM is slow?

Why BLAS SGEMM is slow? - openmp

I'm measuring three approaches to matrix multiplication performance: a naive blocked OpenMP implementation, Eigen, and SGEMM from MKL 2021.4.0. For simplicity all matrices are square, type float, size n x n, aligned at 64-bytes. The compiler is GCC 8.3.1 with compilation flags -msse4.2 -O3 -fopenmp. OS is CentOS 7
I don't understand why MKL SGEMM is the slowest. Why is a naive OpenMP implementation faster than a fancy-optimized library?
Blocked OpenMP (BS = n / 64):
#pragma omp for collapse(2)
for(int i=0; i<n; i++)
for(int j=0; j<n; j++)
C[i*n+j] *= beta;
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < n; i+=BS)
for (int k = 0; k < n; k+=BS)
for (int j = 0; j < n; j+=BS)
for (int ii = i; ii < i+BS; ii++)
for (int kk = k; kk < k+BS; kk++)
for (int jj = j; jj < j+BS; jj++)
C[ii*n+jj] += alpha*A[ii*n+kk]*B[kk*n+jj];
Eigen
Eigen::Map<const Eigen::MatrixXf> AM(A, n, n);
Eigen::Map<const Eigen::MatrixXf> BM(B, n, n);
Eigen::Map<Eigen::MatrixXf> CM(C, n, n);
CM.noalias() = beta*CM + alpha*(BM * AM); // fortran order!
MKL SGEMM
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
n, n, n, alpha, A, n, B, n, beta, C, n);
The google benchmark results for Intel Xeon Silver 4114, 2 sockets, 2 NUMA nodes:
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
MatMul/OmpBlk/4096/64/real_time 1132 ms 1038 ms 1
MatMul/OmpBlk/16384/64/real_time 83668 ms 80612 ms 1
MatMul/OmpBlk/32768/64/real_time 1562980 ms 1492184 ms 1
MatMul/Eigen/4096/real_time 878 ms 867 ms 1
MatMul/Eigen/16384/real_time 36140 ms 31629 ms 1
MatMul/Eigen/32768/real_time 259762 ms 246788 ms 1
MatMul/Blas/4096/real_time 4091 ms 3719 ms 1
MatMul/Blas/16384/real_time 219940 ms 219581 ms 1
MatMul/Blas/32768/real_time 1773874 ms 1750015 ms 1
Simple average time of three runs (no google benchmark, one warm-up not included)
OmpBlk/4096: 1452 ms
OmpBlk/16384: 87494 ms
Eigen/4096: 818 ms
Eigen/16384: 34719 ms
Blas/4096: 4060 ms
Blas/16384: 225647 ms
ldd snippet:
libmkl_intel_ilp64.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.so.1
libmkl_core.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.so.1
libmkl_intel_thread.so.1 => /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so.1
libiomp5.so => /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64/libiomp5.so

Related

How to optimize my solution to dp programming to algorithmic task faster

Now when we have a computer we need to power it, for n days. Each days shop offers m batteries which will last for only one day each. Additionaly when you are buying k items that day you need to pay tax which is k^2. Print the minimum cost to run your computer for n days
For example for
5 5
100 1 1 1 1
100 2 2 2 2
100 3 3 3 3
100 4 4 4 4
100 5 5 5 5
Output will be 18
10 1
1000000000
1000000000
1000000000
1000000000
1000000000
1000000000
1000000000
1000000000
1000000000
1000000000
Output will be 10000000010
I cannot approach this task faster than checking all possibilities. Can you point me to better solution. Limits 1 <= n * m <= 10^6 price can be from 1 to 10^9.

You can simply for each day sort elements (you will want to pick the lowest price first for each day) and add item to priority queue as value + tax(when tax is calculated as 2 * j -1 where j is j-th item worthed to buy that day. That's working cause k^2 - (k + 1)^2. And each day you will remove first item (best battery you can currently buy).
#include <iostream>
#include <utility>
#include <algorithm>
#include <queue>
#include <cmath>
#include <vector>
using namespace std;
int n, m;
vector <long long> pom;
int x;
priority_queue<long long> q;
long long score;
int main(){
cin.tie(0);
ios_base::sync_with_stdio(0);
cin >> n >> m;
for(int i = 0; i < n; i++){
for(int j = 0; j < m; j++){
cin >> x;
pom.push_back(x);
}
sort(pom.begin(), pom.end());
for(int j = 0; j < pom.size(); j++){
pom[j] += 1 + 2 * j;
q.push(-pom[j]);
}
pom.clear();
score += q.top();
q.pop();
}
cout << -score;
}
Complexity for that solution is O(n*m logm)

Random memory write is slower than random memory read?

I'm trying to figure out memory access time of sequential/random memory read/write. Here's the code:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define PRINT_EXCECUTION_TIME(msg, code) \
do { \
struct timeval t1, t2; \
double elapsed; \
gettimeofday(&t1, NULL); \
do { \
code; \
} while (0); \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0; \
elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf(msg " time: %f ms\n", elapsed); \
} while (0);
const int RUNS = 20;
const int N = (1 << 27) - 1;
int *data;
int seqR() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + 1) & N;
res = data_p[pos];
}
}
return res;
}
int seqW() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + 1) & N;
data_p[pos] = res;
}
}
return res;
}
int rndR() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + i) & N;
res = data_p[pos];
}
}
return res;
}
int rndW() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + i) & N;
data_p[pos] = res;
}
}
return res;
}
int main() {
data = (int *)malloc(sizeof(int) * (N + 1));
assert(data);
for (int i = 0; i < N; i++) {
data[i] = i;
}
for (int i = 0; i < 10; i++) {
PRINT_EXCECUTION_TIME("seqR", seqR());
PRINT_EXCECUTION_TIME("seqW", seqW());
PRINT_EXCECUTION_TIME("rndR", rndR());
PRINT_EXCECUTION_TIME("rndW", rndW());
}
return 0;
}
I used gcc 6.5.0 with -O0 to prevent optimization but got result like this:
seqR time: 2538.010000 ms
seqW time: 2394.991000 ms
rndR time: 40625.169000 ms
rndW time: 46184.652000 ms
seqR time: 2411.038000 ms
seqW time: 2309.115000 ms
rndR time: 41575.063000 ms
rndW time: 46206.275000 ms
It's easy to understand that sequential access is way faster than random access. However, it doesn't make sense to me that random write is slower than random read while sequential write is faster than sequential read. What reason could cause this?
In addition, am I safe to say memory bandwidth for seqR is (20 * ((1 << 27) - 1) * 4 * 1024 * 1024 * 1024)GB / (2.538)s = 4.12GB/s?

Sounds normal. All x86-64 CPUs (and most other modern CPUs) use write-back / write-allocate caches so a write costs a read before it can commit to cache, and an eventual write-back.
with -O0 to prevent optimization
Since you used register on all your locals, this is one of the rare times when this didn't make your benchmark meaningless.
You could have just used volatile on your arrays, though, to make sure every one of those accesses happened in order, but leave it up to the optimizer how to make that happen.
Am I safe to say memory bandwidth for seqR is (20 * ((1 << 27) - 1) * 4 * 1024 * 1024 * 1024)GB / (2.538)s = 4.12GB/s?
No, you have an extra factor of 2^30 and 10^9 in your numerator. But you did it wrong and got close to the right number anyway.
The correct calculation is RUNS * N * sizeof(int) / time bytes per second, or that divided by 10^9 GB/s. Or divided by 2^30 for base 2 GiB/s. Memory sizes are usually in GiB, but you can take your pick with bandwidth; DRAM clock speeds are normally things like 1600 MHz, so base-10 GB = 10^9 is certainly normal for theoretical max bandwidths in GB/s.)
So 4.23 GB/s in base-10 GB.
Yes, you initialized the array first so neither timed run is triggering page-faults, but I might still have used the 2nd run after the CPU has warmed up to max turbo, if it hadn't already.
But keep in mind this is un-optimized code. That's how fast your un-optimized code ran, and doesn't tell you much about how fast your memory is. It's probably CPU bound, not memory.
Especially with a redundant & N in there to match the CPU work of the rndR/W functions. HW prefetching is probably able to keep up with 4GB/s, but it's still not even reading 1 int per clock cycle.

OpenMP reduction on SSE2 vector

I want to compute the average of an image (3 channels of interest + 1 alpha channel we ignore here) for each channel using SSE2 intrinsics. I tried that:
__m128 average = _mm_setzero_ps();
#pragma omp parallel for reduction(+:average)
for(size_t k = 0; k < roi_out->height * roi_out->width * ch; k += ch)
{
float *in = ((float *)temp) + k;
average += _mm_load_ps(in);
}
But I get this error with GCC: user-defined reduction not found for average.
Is that possible with SSE2 ? What's wrong ?
Edit
This works:
float sum[4] = { 0.0f };
#pragma omp parallel for simd reduction(+:sum[:4])
for(size_t k = 0; k < roi_out->height * roi_out->width * ch; k += ch)
{
float *in = ((float *)temp) + k;
for (int i = 0; i < ch; ++i) sum[i] += in[i];
}
const __m128 average = _mm_load_ps(sum) / ((float)roi_out->height * roi_out->width);

You can user-define a custom reduction like this:
#pragma omp declare reduction \
(addps:__m128:omp_out+=omp_in) \
initializer(omp_priv=_mm_setzero_ps())
And then use it like:
#pragma omp parallel for reduction(addps:average)
for(size_t k = 0; k < size * ch; k += ch)
{
average += _mm_loadu_ps(data+k);
}
I think, most importantly, openmp needs to know how to get a neutral element (here _mm_setzero_ps()) for your reduction.
Full working example: https://godbolt.org/z/Fpqttc
Interesting link: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-reduction.html#User-definedreductions

Nested loop in OpenMP performance issue

I have such a uninformative nested loops (just as test of performance):
const int N = 300;
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
about 6 s
I tried to parallelize different loops with OpenMP. But I am very confused with the results I got.
In the first step I used "parallel for" pragma only for the first (outermost) loop:
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was (2 cores)
3.81
Then I tried to parallelize two inner loops with "collapse" clause (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for collapse(2) schedule(static) reduction(+:sum1, sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
The elapsed time was
3.76
This is faster then in previous case. And I do not understand the reason of this.
If I use fusing of these inner loops (which is meant to be better in the sense of performance) like this
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int n = 0; n < N * N; n++) {
int i = n / N; int j = n % N;
the elapsed time is
5.53
This confuses me so much. The performance is worse in this case, though usually people advise to fuse loops for better performance.
Okay, now let's try to parallelize only middle loop like this (2 cores):
for (int num = 0; num < 10000; num++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Again, the performance becomes better:
3.703
And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores):
for (int num = 0; num < 10000; num++) {
for (int i=0; i<N; i++) {
#pragma omp parallel for schedule(static) reduction(+:sum1,sum2)
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
But (surprise!) the elapsed time is
about 11 s
This is much slower than in previous cases. I cannot catch the reason of all of this.
By the way, I was looking for similar questions, and I found advice of adding
#pragma omp parallel
before the first loop (for example, in this and that questions). But why is it right procedure? If we place
#pragma omp parallel#
before for-loop it means that each thread executes for-loop completely, which is incorrect (excess work). Indeed, I tried to insert
#pragma omp parallel
before the outermost loop with different locations of
#pragma omp parallel for
as I am describing here, and the performance was worse in call cases (moreover, in the latest case when parallelizing the innermost loop only, answer was also incorrect (namely, "sum2" was different - as there was a race condition).
I would like to know the reasons of such a performance (probably the reason is that time of data exchange is greater than time of actual computation on each thread, but this is in the latest case) and what solution is the most correct one.
EDIT: I've disabled compiler's optimization (by $-O0$ option) and results still the same (except that time elapsed in the latest example (when parallelizing the innermost loop) reduced from 11 s to 8 s).
Compiler options:
g++ -std=gnu++0x -fopenmp -O0 test.cpp
Definition of variables:
unsigned int seed;
const int N = 300;
int main()
{
double arr[N][N];
double brr[N][N];
for (int i=0; i < N; i++) {
for (int j = 0; j < N; j++) {
arr[i][j] = i * j;
brr[i][j] = i + j;
}
}
double start = omp_get_wtime();
double crr[N][N];
double sum1 = 0;
double sum2 = 0;

And the final step - parallelization of the innermost loop only (assuming that this will be the fastest case according to the previous results) (2 cores)
But (surprise!) the elapsed time is:
about 11 s
It is not a surprise at all. Parallel blocks perform implicit barriers and can even join and create threads (some libraries may use thread pools to reduce the cost of thread creation).
In the end, opening parallel regions is expensive. You should do it as few times as possible. The threads will run the outer loops in parallel, at the same time, but will divide the iteration space once they reach the omp for block, so the result should still be correct (you should make your program check this if you are unsure).
For testing performance, you should always run your experiments turning compiler optimizations, as they have a heavy impact on the behavior of the application (you should not make assumptions about performance on unoptimized programs because their problems may be already addressed during optimization).
When making a single parallel block that contains all the loops, the execution time is halved in my setup (started with 9.536s using 2 threads, and reduced to 4.757s).
The omp for block still applies implicit barriers, which is not needed in your example. Adding the nowait clause to the example reduces the execution time by another half: 2.120s.
From this point, you can now try to explore the other options.
Parallelizing middle loop reduces execution time to only 0.732s due to much better usage of the memory hierarchy and vectorization. L1 miss ratio reduced from ~29% to ~0.3%.
Using collapse with the two innermost loops made no big deal using two threads (strong scaling should be checked).
Using other directives such as omp simd does not improve performance in this case, as the compiler is sure enough that it can vectorize the innermost loop safely.
#pragma omp parallel reduction(+:sum1,sum2)
for (int num = 0; num < 10000; num++) {
#pragma omp for schedule(static) nowait
for (int i=0; i<N; i++) {
for (int j=0; j<N; j++) {
arr[i][j] = brr[i][j];
crr[i][j] = arr[i][j] - brr[i][j];
sum1 += crr[i][j];
sum2 += arr[i][j];
}
}
}
Note: L1 miss ratio computed using perf:
$ perf stat -e cache-references,cache-misses -r 3 ./test

Since variables in parallel programming are shared among threads (cores), you should consider how the processor cache-memory take in action. at this point your code might executed with a false-sharing which could hurt your processor performance.
At your 1st parallel code, you call #pragma omp for right at the first for, it means each thread has its own i and j. Compare with 2nd and 3rd (only differentiated by collapse) parallel code that parallelized the 2nd of for, it means each of i has its own j. These two code better because each thread/core more often hits the cache-line of j. The 4th code is completely disaster for caches processor because nothing to be shared there.
I recommends you to measure your code with Intel's PCM or PAPI in order to get a proper analyst.
Regards.

Loop sequence in OpenMP Collapse performance advise

I found Intel's performance suggestion on Xeon Phi on Collapse clause in OpenMP.
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ j + jmax*i] = 1.;
}
Modified example for better performance:
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ k++] = 1.;
}
I test both case in Fortran with similar code on regular CPU using GFortran 4.8, they both get correct result. Test using similar Fortran Code with later code does not pass for GFortran5.2.0 and Intel 14.0
But as far as I understand, the loop body for OpenMP should avoid "loop sequence dependent" variable, for this case is k, so why in the later case it can get correct result and even better performance?

Here's the equivalent code for the two approaches when using collapse clause. You could see the second one is better.
for(int k=0; k<imax*jmax; k++) {
int i = k / jmax;
int j = k % jmax;
a[j + jmax*i]=1.;
}
for(int k=0; k<imax*jmax; k++) {
a[k]=1.;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio