How to generate simd code for math function "exp" using openmp? - openmp

I am having a simple c code as follows
void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}
I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?

You can use one of the following four alternatives to vectorize the exp function.
Note that I have used expf (float) instead of exp, which is a double function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.
#include<math.h>
int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf function.
Therefore you need optimization level -Ofast (which allows less accurate code)
instead of -O3 to get the code vectorized with gcc.
See this libmvec page for futher details.
The following test code compiles and runs successfully with gcc 7.3:
#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int main(){
float x[32];
float y[32];
int i;
int N = 32;
for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8e\n", x[i], y[i]);
return 0;
}

Related

Error correction on small message (8-Bit) with high resilience, what is the best method?

I need to implement an ECC algorithm on an 8-bit message with 32 bits to work with (32, 8), being new to ECC I started to google and learn a bit about it and ended up coming across two ECC methods, Hamming codes and Reed Solomon. Given that I needed my message to be resilient to 4-8 random bit flips on average I disregarded Hammings and looked into Reed, however, after applying it to my problem I realized it is also not suitable for my use case because while a whole symbol (8 bits) could be flipped, because my errors tend to spread out (on average), it can usually only fix a single error...
Therefore in the end I just settled for my first instinct which is to just copy the data over like so:
00111010 --> 0000 0000 1111 1111 1111 0000 1111 0000
This way every bit is resilient up to 1 error (8 across all bits) by taking the most prominent bits on each actual bit from the encoded message, and every bit can be subject to two bitflips while still detecting there was an error (which is also usable for my use case, eg: input 45: return [45, 173] is still useful).
My question then is if there is any better method, while I am pretty sure there is, I am not sure where to go from here.
By "better method" I mean resilient to even more errors given the (32, 8) ratio.
You can get a distance-11 code pretty easily using randomization.
#include <stdio.h>
#include <stdlib.h>
int main() {
uint32_t codes[256];
for (int i = 0; i < 256; i++) {
printf("%d\n", i);
retry:
codes[i] = arc4random();
for (int j = 0; j < i; j++) {
if (__builtin_popcount(codes[i] ^ codes[j]) < 11) goto retry;
}
}
}
I made a test program for David Eisenstat's example, to show it works for 1 to 5 bits in error. Code is for Visual Studio.
#include <intrin.h>
#include <stdio.h>
#include <stdlib.h>
typedef unsigned int uint32_t;
/*----------------------------------------------------------------------*/
/* InitCombination - init combination */
/*----------------------------------------------------------------------*/
void InitCombination(int a[], int k, int n) {
for(int i = 0; i < k; i++)
a[i] = i;
--a[k-1];
}
/*----------------------------------------------------------------------*/
/* NextCombination - generate next combination */
/*----------------------------------------------------------------------*/
int NextCombination(int a[], int k, int n) {
int pivot = k - 1;
while (pivot >= 0 && a[pivot] == n - k + pivot)
--pivot;
if (pivot == -1)
return 0;
++a[pivot];
for (int i = pivot + 1; i < k; ++i)
a[i] = a[pivot] + i - pivot;
return 1;
}
/*----------------------------------------------------------------------*/
/* Rnd32 - return pseudo random 32 bit number */
/*----------------------------------------------------------------------*/
uint32_t Rnd32()
{
static uint32_t r = 0;
r = r*1664525+1013904223;
return r;
}
static uint32_t codes[256];
/*----------------------------------------------------------------------*/
/* main - test random hamming distance 11 code */
/*----------------------------------------------------------------------*/
int main() {
int ptn[5]; /* error bit indexes */
int i, j, n;
uint32_t m;
int o, p;
for (i = 0; i < 256; i++) { /* generate table */
retry:
codes[i] = Rnd32();
for (j = 0; j < i; j++) {
if (__popcnt(codes[i] ^ codes[j]) < 11) goto retry;
}
}
for(n = 1; n <= 5; n++){ /* test 1 to 5 bit error patterns */
InitCombination(ptn, n, 32);
while(NextCombination(ptn, n, 32)){
for(i = 0; i < 256; i++){
o = m = codes[i]; /* o = m = coded msg */
for(j = 0; j < n; j++){ /* add errors to m */
m ^= 1<<ptn[j];
}
for(j = 0; j < 256; j++){ /* search for code */
if((p =__popcnt(m ^ codes[j])) <= 5)
break;
}
if(i != j){ /* check for match */
printf("fail %u %u\n", i, j);
goto exit0;
}
}
}
}
exit0:
return 0;
}

Illegal context for vector clause in simple OpenACC kernel

I'm trying to compile a simple OpenACC benchmark:
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc parallel copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop vector(128)
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
with Nvidia HPC SDK 21.5 and run into an error
$ nvc++ -S tmp.cc -Wall -Wextra -O2 -acc -acclibs -Minfo=all -g -gpu=cc80
NVC++-S-0155-Illegal context for gang(num:) or worker(num:) or vector(length:) (tmp.cc: 7)
NVC++/x86-64 Linux 21.5-0: compilation completed with severe errors
Any idea what may cause this? From what I can tell my syntax for vector(128) is legal.
It's illegal OpenACC syntax to use "vector(value)" with a parallel construct. You need to use a "vector_length" clause on the parallel directive to define the vector length. The reason is because "parallel" defines a single compute region to be offloaded and hence all vector loops in this region need to have the same vector length.
You can use "vector(value)" only with a "kernels" construct since the compiler can then split the region into multiple kernels each having a different vector length.
Option 1:
% cat test.c
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc parallel vector_length(128) copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop vector
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
% nvc -acc -c test.c -Minfo=accel
foo:
4, Generating copyout(c[:c_stride*256]) [if not already present]
Generating copyin(a[:a_stride*256]) [if not already present]
Generating Tesla code
5, #pragma acc loop vector(128) /* threadIdx.x */
7, #pragma acc loop seq
5, Loop is parallelizable
7, Loop is parallelizable
Option 2:
% cat test.c
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc kernels copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop independent vector(128)
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
% nvc -acc -c test.c -Minfo=accel
foo:
4, Generating copyout(c[:c_stride*256]) [if not already present]
Generating copyin(a[:a_stride*256]) [if not already present]
5, Loop is parallelizable
Generating Tesla code
5, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
7, #pragma acc loop seq
7, Loop is parallelizable

How to distribute teams on GPU using OpenMP?

i'm trying to utilize my Nvidia Geforce GT 740M for parallel-programming using OpenMP and the clang-3.8 compiler.
When processed in parallel on the CPU, I manage to get the desired result. However, when processed on the GPU, my results are some almost random numbers.
Therefore, I figured that I'm not correctly distributing my thread teams and that there might be some data races. I guess I have to do my for-loops differently but I have no idea where the mistake could be.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[])
{
const int n =100; float a = 3.0f; float b = 2.0f;
float *x = (float *) malloc(n * sizeof(float));
float *y = (float *) malloc(n * sizeof(float));
int i;
int j;
int k;
double start;
double end;
start = omp_get_wtime();
for (k=0; k<n; k++){
x[k] = 2.0f;
y[k] = 3.0f;
}
#pragma omp target data map(to:x[0:n]) map(tofrom:y[0:n]) map(to:i) map(to:j)
{
#pragma omp target teams
#pragma omp distribute
for(i = 0; i < n; i++) {
#pragma omp parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}
}
end = omp_get_wtime();
printf("Work took %f seconds.\n", end - start);
free(x); free(y);
return 0;
}
I guess that it might have something to to with the Architecture of my GPU. So therefore I'm adding this:
Im fairly new to the topic, so thanks for your help :)
Yes, there is a race here. Different teams are reading and writing to the same element of the array 'y'. Perhaps you want something like this?
for(i = 0; i < n; i++) {
#pragma omp target teams distribute parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}

GCC fails to vectorize a simple 2-level nested loop while Intel compiler succeeds

I have the following two versions of the same loop:
// version 1
for(int x = 0; x < size_x; ++x)
{
for(int y = 0; y < size_y; ++y)
{
data[y*size_x + x] = value;
}
}
// version 2
for(int y = 0; y < size_y; ++y)
{
for(int x = 0; x < size_x; ++x)
{
data[y*size_x + x] = value;
}
}
I compile the above codes using two compilers:
Intel (17.0.1): I compile the code using: icc -qopenmp -O3 -qopt-report main.cpp. Both are vectorized successfully.
GCC (5.1): I compile the code using: g++ -fopenmp -ftree-vectorize -fopt-info-vec -O3 main.cpp. Only version 2 is vectorized.
Here are my questions:
Why GCC fails to vectorize version 1? Is it because the inner loop in version 1 doesn't access contiguous memory?
If the answer to the above is 'yes': is it for GCC impossible to vectorize it or it chooses not to because it won't have any performance benefit? If it is the latter, can I somehow force GCC to vectorize it no matter what?
Apparently in version 1 the vectorization report of the Intel compiler includes these lines: Loopnest Interchanged: ( 1 2 ) --> ( 2 1 ) and PERMUTED LOOP WAS VECTORIZED; while in version two I get this: LOOP WAS VECTORIZED. So it appears that Intel compiler rearranges the order of the loop in order to vectorize it? Do I understand this correct?
Can I achieve something similar to the above with GCC?
EDIT 1:
Thanks to MarcGlisse I investigated further by creating a simplified example of my code and realized that different combination of my data size and compilation flags on GCC will achieve different vectorization. At this point I am more confused and I think it is better to create a new post to first understand how GCC vectorization works. In case someone is curious you can check the code below and try the values 1, 2, 3, 4, 5, 6, 7 for size_x and size_y. Also try them once with MarcGlisse's compilation flags and once without. Different combinations might give different vectorization results.
void foo1(int size_x, int size_y, float value, float* data)
{
for(int x = 0; x < size_x; ++x)
{
for(int y = 0; y < size_y; ++y)
{
data[y*size_x + x] = value;
}
}
}
void foo2(int size_x, int size_y, float value, float* data)
{
for(int y = 0; y < size_y; ++y)
{
for(int x = 0; x < size_x; ++x)
{
data[y*size_x + x] = value;
}
}
}
int main(int argc, char** argv)
{
int size_x = 7;
int size_y = 7;
int size = size_x*size_y;
float* data1 = new float[size];
float* data2 = new float[size];
foo1(size_x, size_y, 1, data1);
foo2(size_x, size_y, 1, data2);
delete [] data1;
delete [] data2;
return 0;
}

PGI Compiler Parallelization +=

I am working on getting a vector and matrix class parallelized and have run into an issue. Any time I have a loop in the form of
for (int i = 0; i < n; i++)
b[i] += a[i] ;
the code has a data dependency and will not parallelize. When working with the intel compiler it is smart enough to handle this without any pragmas (I would like to avoid the pragma for no dependency check just due to the vast number of loops similar to this and because the cases are actually more complicated than this and I would like it to check just in case one does exist).
Does anyone know of a compiler flag for the PGI compiler that would allow this?
Thank you,
Justin
edit: Error in the for loop. Wasn't copy pasting an actual loop
I think the problem is you're not using the restrict keyword in these routines, so the C compiler has to worry about pointer aliasing.
Compiling this program:
#include <stdlib.h>
#include <stdio.h>
void dbpa(double *b, double *a, const int n) {
for (int i = 0; i < n; i++) b[i] += a[i] ;
return;
}
void dbpa_restrict(double *restrict b, double *restrict a, const int n) {
for (int i = 0; i < n; i++) b[i] += a[i] ;
return;
}
int main(int argc, char **argv) {
const int n=10000;
double *a = malloc(n*sizeof(double));
double *b = malloc(n*sizeof(double));
for (int i=0; i<n; i++) {
a[i] = 1;
b[i] = 2;
}
dbpa(b, a, n);
double error = 0.;
for (int i=0; i<n; i++)
error += (3 - b[i]);
if (error < 0.1)
printf("Success\n");
dbpa_restrict(b, a, n);
error = 0.;
for (int i=0; i<n; i++)
error += (4 - b[i]);
if (error < 0.1)
printf("Success\n");
free(b);
free(a);
return 0;
}
with the PGI compiler:
$ pgcc -o tryautop tryautop.c -Mconcur -Mvect -Minfo
dbpa:
5, Loop not vectorized: data dependency
dbpa_restrict:
11, Parallel code generated with block distribution for inner loop if trip count is greater than or equal to 100
main:
21, Loop not vectorized: data dependency
28, Loop not parallelized: may not be beneficial
36, Loop not parallelized: may not be beneficial
gives us the information that the dbpa() routine without the restrict keyword wasn't parallelized, but the dbpa_restict() routine was.
Really, for this sort of stuff, though, you're better off just using OpenMP (or TBB or ABB or...) rather than trying to convince the compiler to autoparallelize for you; probably better still is just to use existing linear algebra packages, either dense or sparse, depending on what you're doing.

Resources