Intel Compiler Openmp SIMD AVX512 performance problem - openmp

I am learning the openmp simd part and wrote a small program to test the performance of simd.
System is centos7.The cpu I am using is Intel(R) Xeon(R) Gold 6258R CPU # 2.70GHz, which I believe supports avx512.
This my code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define M 100000
int main(){
float a[M],b[M];
double t1,t2;
t1=omp_get_wtime();
#pragma omp simd
for(int j=0;j<M;j++){
for(int i=0;i<M;i++){
a[i]=log(pow(2.71828,(pow(sin(pow(1.1,1.1)),1.1)+1.0))+j);
b[i]=cos(log(pow(2.71828,pow(sin(pow(1.1,1.1)),1.1)+1.0))+j);
}
}
t2=omp_get_wtime();
printf("Elapsed CPU time = %lf seconds.\n",t2-t1);
printf("simd:a[10] = %f , b[10] = %f \n",a[10],b[10]);
printf("\n");
t1=omp_get_wtime();
for(int j=0;j<M;j++){
for(int i=0;i<M;i++){
a[i]=log(pow(2.71828,(pow(sin(pow(1.1,1.1)),1.1)+1.0))+j);
b[i]=cos(log(pow(2.71828,pow(sin(pow(1.1,1.1)),1.1)+1.0))+j);
}
}
t2=omp_get_wtime();
printf("Elapsed CPU time = %lf seconds.\n",t2-t1);
printf("simd:a[10] = %f , b[10] = %f \n",a[10],b[10]);
printf("\n");
}
I tried to compile my porgram with intel compiler 2021.6.0:
icc -qopenmp -qopt-report -march=pentium4m 7.c
icc -qopenmp -qopt-report -march=corei7 7.c
icc -qopenmp -qopt-report -march=core-avx2 7.c
icc -qopenmp -qopt-report -march=skylake-avx512 -qopt-zmm-usage=high 7.c
which use MMX, SSE,AVX2 and AVX512 if I am correct.
The MMX works fine. The speedup ratio is 2 (64/32). But the rest of them result in a speedu ratio of 2 4 8.
Shouldn't it be 4, 8 and 16?(128/32, 256/32 and 512/32)

Related

Raspberry PI4 Segmentation Fault

I have a short program that is causing Segmentation Fault on RPi4 after run several times (e.g.: 10 times in a loop).
I am using Raspbian GNU/Linux 10 (buster) and default gcc compiler (sudo apt install build-essential)
gcc --version
gcc (Raspbian 8.3.0-6+rpi1) 8.3.0
Do you think this is a gcc compiler problem? Maybe I am missing some special settings for RPi4.
I am using this to build:
gcc threads.c -o threads -l pthread
The output is sometimes (not always) something like this:
...
in thread_dummy, loop: 003
Segmentation fault
The code is here:
#include <stdio.h> /* for puts() */
#include <unistd.h> /* for sleep() */
#include <stdlib.h> /* for EXIT_SUCCESS */
#include <pthread.h>
#define PTR_SIZE (0xFFFFFF)
#define PTR_CNT (10)
void* thread_dummy(void* param)
{
void* ptr = malloc(PTR_SIZE);
//fprintf(stderr, "thread num: %03i, stack: %08X, heap: %08X - %08X\n", (int)param, (unsigned int)&param, (unsigned int)ptr, (unsigned int)((unsigned char*)ptr + PTR_SIZE));
fprintf(stderr, "in thread_dummy, loop: %03i\n", (int)param);
sleep(1);
free(ptr);
pthread_detach(pthread_self());
return NULL;
}
int main(void)
{
void* ptrs[PTR_CNT];
pthread_t threads[PTR_CNT];
for(int i=0; i<PTR_CNT; ++i)
{
ptrs[i] = malloc(PTR_SIZE);
//fprintf(stderr, "main num: %03i, stack: %08X, heap: %08X - %08X\n", i, (unsigned int)&ptrs, (unsigned int)ptrs[i], (unsigned int)((unsigned char*)ptrs[i] + PTR_SIZE));
fprintf(stderr, "in main, loop: %03i\n", i);
}
fprintf(stderr, "-----------------------------------------------------------\n");
for(int i=0; i<PTR_CNT; ++i)
pthread_create(&threads[i], 0, thread_dummy, (void*)i);
for(int i=0; i<PTR_CNT; ++i)
pthread_join(threads[i], NULL);
for(int i=0; i<PTR_CNT; ++i)
free(ptrs[i]);
return EXIT_SUCCESS;
}
UPDATE:
I also tested it with new gcc, but the problem remains...
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/arm-linux-gnueabihf/11.1.0/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../configure --enable-languages=c,c++,fortran --with-cpu=cortex-a72 --with-fpu=neon-fp-armv8 --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.1.0 (GCC)
pthread_create is like malloc, and pthread_detach or pthread_join is like free. You are basically doing something like "double free" - you detach a thread and join it at the same time. Either detach or join the thread.
You could remove pthread_join from main. But you should logically remove pthread_detach(...) from inside the thread, which is actually useless because the thread terminates right after anyway.

Why do I get 128 teams/thread blocks and 96 threads in each teams/thread blocks using #pragma omp target team distribute parallel for in OpenMP?

I am running this code on Ubuntu 18.04, clang/llvm compiler with Nvidia GTX 1070 GPU
#pragma omp target data map(to: A,B) map(from: C)
{
#pragma omp target teams distribute
for(int n=0; n<Row; n++)
{
int team_id= omp_get_team_num();
#pragma omp parallel for default(shared) schedule(auto)
for(int j = 0; j <Col; j++)
{
int thread_id = omp_get_thread_num();
printf("Iteration= c[ %d ][ %d ], Team=%d, Thread=%d\n",n, j, team_id, thread_id);
C[n][j] = A[n][j] + B[n][j];
}
}
}
in the above code, max value of team is 127 and thread is 95
compile flags: clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_61 -Wall -O3 debug.cpp -o debug

AVX vs. SSE: expect to see a larger speedup

I expected AVX to be about 1.5x faster than SSE. All 3 arrays (3 arrays * 16384 elements *4 bytes/element = 196608 bytes) should fit in L2 cache (256KB) on an Intel Core CPU (Broadwell).
Are there any special compiler directives or flags that I should be using?
Compiler Version
$ clang --version
Apple LLVM version 9.0.0 (clang-900.0.38)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Compile line
$ make avx
clang -O3 -fno-tree-vectorize -msse -msse2 -msse3 -msse4.1 -mavx -mavx2 avx.c ; ./a.out 123
n: 123
AVX Time taken: 0 seconds 177 milliseconds
vector+vector:begin int: 1 5 127 0
SSE Time taken: 0 seconds 195 milliseconds
vector+vector:begin int: 1 5 127 0
avx.c
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>
#include <time.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
#define REPS 50000
#define AR 16384
// add int vectors via AVX
__attribute__((noinline))
void add_iv_avx(__m256i *restrict a, __m256i *restrict b, __m256i *restrict out, int N) {
__m256i *x = __builtin_assume_aligned(a, 32);
__m256i *y = __builtin_assume_aligned(b, 32);
__m256i *z = __builtin_assume_aligned(out, 32);
const int loops = N / 8; // 8 is number of int32 in __m256i
for(int i=0; i < loops; i++) {
_mm256_store_si256(&z[i], _mm256_add_epi32(x[i], y[i]));
}
}
// add int vectors via SSE; https://en.wikipedia.org/wiki/Restrict
__attribute__((noinline))
void add_iv_sse(__m128i *restrict a, __m128i *restrict b, __m128i *restrict out, int N) {
__m128i *x = __builtin_assume_aligned(a, 16);
__m128i *y = __builtin_assume_aligned(b, 16);
__m128i *z = __builtin_assume_aligned(out, 16);
const int loops = N / sizeof(int);
for(int i=0; i < loops; i++) {
//out[i]= _mm_add_epi32(a[i], b[i]); // this also works
_mm_storeu_si128(&z[i], _mm_add_epi32(x[i], y[i]));
}
}
// printing
void p128_as_int(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("int: %i %i %i %i\n", v[0], v[1], v[2], v[3]);
}
__attribute__((noinline))
void debug_print(int *h) {
printf("vector+vector:begin ");
p128_as_int(* (__m128i*) &h[0] );
}
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
printf("n: %d\n", n);
int *x,*y,*z;
if (posix_memalign((void**)&x, 32, 16384*sizeof(int))) { free(x); return EXIT_FAILURE; }
if (posix_memalign((void**)&y, 32, 16384*sizeof(int))) { free(y); return EXIT_FAILURE; }
if (posix_memalign((void**)&z, 32, 16384*sizeof(int))) { free(z); return EXIT_FAILURE; }
x[0]=0; x[1]=2; x[2]=4;
y[0]=1; y[1]=3; y[2]=n;
// touch each 4K page in x,y,z to avoid copy-on-write optimizations
for (int i=512; i<AR; i+= 512) { x[i]=1; y[i]=1; z[i]=1; }
// warmup
for(int i=0; i<REPS; ++i) { add_iv_avx((__m256i*)x, (__m256i*)y, (__m256i*)z, AR); }
// AVX
clock_t start = clock();
for(int i=0; i<REPS; ++i) { add_iv_avx((__m256i*)x, (__m256i*)y, (__m256i*)z, AR); }
int msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf(" AVX Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(z);
// warmup
for(int i=0; i<REPS; ++i) { add_iv_sse((__m128i*)x, (__m128i*)y, (__m128i*)z, AR); }
// SSE
start = clock();
for(int i=0; i<REPS; ++i) { add_iv_sse((__m128i*)x, (__m128i*)y, (__m128i*)z, AR); }
msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf("\n SSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(z);
return EXIT_SUCCESS;
}
The problem is that that your data doesn't fit in the L1 cache.
The L1 bandwidth of Broadwell is much larger than the L2 bandwidth.
The L1 bandwidth is large enough to load two 32 byte vectors every cpu cycle. So, a better AVX vs. SSE speedup
might be expected if your data set is much smaller. However, note that
the combined L1 read/write bandwidth is less than 2*32(r)+32(w)=96 bytes per cycle.
In practice 75 bytes per cycle is possible, see here.
The second graph on this page shows that indeed the L2 bandwidth is much smaller:
At Test_block_size=128KB (=32KB per core) the bandwidth is 900GB/s.
At Test_block_size=1MB (=256KB per core) the bandwidth is only 300GB/s.
(Note that Haswell 4770k has more or less the same L1 and L2 cache architecture as Broadwell.)
Try to reduce AR to 2000 and to increase NREP to 1000000 and see what happens with the SSE vs. AVX speedup.

Why a simple for loop without OpenMP is faster than it with OpenMP

Here is my test code for OpenMP
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
int main(int argc, char const *argv[]){
double x[10000];
clock_t start, end;
double cpu_time_used;
start = clock();
#pragma omp parallel
#pragma omp for
for (int i = 0; i < 10000; ++i){
x[i] = 1;
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%lf\n", cpu_time_used);
return 0;
}
I compiled the code with the following two commands:
gcc test.c -o main
The output of rum main is 0.000039
Then I compiled with OpenMP
gcc test.c -o main -fopenmp
and the output is 0.008020
Could anyone help me understand why it happens. Thanks beforehand.
As High Performance Mark so eloquently described in his comment, there is a cost (overhead) with creating threads and distributing work. For such a tiny piece of work (39 us), the overhead outweighs any possible gains.
That said, your measurement is also misleading. clock measures CPU time and is most likely not what you wanted (wall clock). For more details, see this question.
Another misconception that you might have: As soon as x is large enough, the simple loop will become memory-bound. And you will likely not see the speedup you expect. For example on a typical desktop system with four cores you might see a speedup of 1.5 x instead of 4 x.

No speed-up after using MKL for Eigen

I use Eigen 3.3 and Intel MKL 2017, and write and run program in Visual Studio 2012 with Win-7 64-bit system and Intel Xeon(R) CPU E5-1620 v2#3.70GHZ CPU.
I belive that my configuration for MKL is correct, because I can succesfully run MKL examlpe codes. The configuraton for using Intel MKL from Eigen follows from https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html. For Visiual Studio 2012, I complie codes via Intel C++ Complier in Release x64 model.
However, the following code always takes about 400 seconds, no matter if I set #define EIGEN_USE_MKL_ALL or not (i.e., if use Intel MKL). It seems that MKL does not work in Eigen.
Could anyone give some suggestion? Thanks.
#define EIGEN_USE_MKL_ALL // Determine if use MKL
#define EIGEN_VECTORIZE_SSE4_2
#include "stdafx.h"
#include <iostream>
#include <Eigen/Core>
#include <Eigen/Dense>
#include <time.h>
using namespace std;
using namespace Eigen;
//
int main(int argc, char *argv[])
{
MatrixXd a = MatrixXd::Random(30000, 3000);
MatrixXd b = MatrixXd::Random(3000, 30000);
double start = clock();
MatrixXd c = a * b; //
double endd = clock();
double thisTime = (double)(endd - start) / CLOCKS_PER_SEC;
cout << thisTime << endl;
system("PAUSE");
return 0;
}

Resources