OpenMP pragma with a meaning: don't vectorize

OpenMP pragma with a meaning: don't vectorize - openmp

I have a loop with a really low trip count where vectorization doesn't pay off, but the compiler vectorizes it nevertheless. Is there a portable way to tell a compiler not to vectorize a loop, an opposite of #pragma omp simd

If you have a compiler that supports the OpenMP API version 5.1, then it should accept this:
#pragma omp simd if(simd:0)
for (...) { ... }
That should then disable vectorization for the loop associated with the simd construct.
See simd Construct in the OpenMP specification.

Related

How OpenMP macros work behind the scenes in collaboration with the preprocessor/compiler and the library itself?

I'm trying to implement a similar functionality to one of my projects and I was wondering how it works.
For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective? I'm referencing the compiler since I have read that #pragma macros are to give side information to the compiler.
If I take into account that all the macros are handled by the preprocessor it gets really confusing to me.
How is the macro expanded and how the OpenMP library gets access to the information in those macros? Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
#include <stdio.h>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
{
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int iam = 0, np = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
#pragma omp parallel default(shared) private(iam, np)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hybrid: Hello from thread %d out of %d from process %d out of %d on %s\n",
iam, np, rank, numprocs, processor_name);
}
MPI_Finalize();
return 0;
}
I got this example from here.

For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective?
This is strongly dependent of the compiler implementation. In practice, for Clang and GCC (and probably ICC), the pragma annotation gives information to compilers steps enabling it to transform the code in a front-end pass. Put it simply, the front-end of a compiler is the one doing preprocessing, tokenization, syntactic analysis and semantic analysis, while the back-end does optimizations and code generation.
For most steps, mainstream compilers enable you to get the temporary output intermediate code. For example Clang and GCC have the -E flag for the preprocessor and -S for code generation. Low-level intermediate representation (IR) are more dependant to a compiler implementation so the flags are not the same (nor the optimizations and the intermediate language). GCC use a GENERIC/GIMPLE language for the high-level IR while Clang use the LLVM IR language. AFAIK, the GIMPLE code can be dump using the -fdump-* flags. For Clang, -emit-llvm can be used to dump the IR code.
In Clang, the transformation is done after the AST generation, but before the first IR generation. Note that some other compilers does an AST transformation, while some other do that in later steps. When OpenMP is enabled (with -fopenmp), Clang replaces the pragma region with an __kmpc_fork_call and generate a function for the region which is passed to KMP function. KMP is the prefix for the IOMP runtime shared by both Clang and ICC. GCC has its own runtime called GOMP. There are many other runtimes but the mainstream ones are GOMP and IOMP. Also note that GCC uses a similar strategy by calling GOMP_parallel with a generated function provided at runtime. The IOMP/GOMP runtimes take care of initializing the region and the ICV before calling the compiler-generated function.
Note that the processor is not aware of the use of OpenMP (at least not for all OpenMP implementations I am aware of).
How is the macro expanded and how the OpenMP library gets access to the information in those macros?
Note that pragma annotations are not macros, there are more powerful than that: they provide information to the compiler that can perform non trivial changes during any compilation steps. For example, a pragma can change the way the code generation is performed which is impossible with preprocessor macros (eg. #pragma GCC unroll n for loop unrolling in GCC and #pragma ivdep for telling ICC that there is no loop-carried dependencies enabling auto-vectorization).
The information are passed to the main runtime fork function as arguments (ie. __kmpc_fork_call and GOMP_parallel) like the compiler-generated user function.
Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
It is not just simple macros invocation and AFAIK there is no external module for GCC and Clang. They are directly integrated to the compiler (though it may be modular, especially for Clang). This is important because compilers need to analyse the pragma annotations at compile-time. The pragma are not just a way to automatically generate runtime calls and abstract them with a standard language/interface, they also impact the compiler steps. For example, #pragma omp simd should impact the auto-vectorization optimization steps of compilers (back-end steps).
AFAIK, there are some (research) OpenMP implementations based on a source-to-source compilation so to be compiler independent but I am not sure they supports all OpenMP features (especially SIMD ones).

Is batching same functions with SIMD instruction possible?

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
Thanks!

SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.

There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
E.g.
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
}
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
E.g.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

best way to mix OpenACC with OpenMP

I have a code which is mainly developed using OpenACC, I would like to compare P100 results with OpenACC to Intel's KNL nodes using OpenMP, I tried using the compiler flag -ta=multi_core but it basically serialized all the loops (as per -acc info). Is the only way to use a preprocessor directive for all the loops? Are there any other more efficient or cleaner ways?
#ifndef _OPENACC
#pragma omp .....
#else
#pragma acc ......
#endif

OpenMP: what is the difference between "taskloop" and "omp for" performance wise?

"taskloop" is introduced in OpenMP 4.5. It can take clauses from both loop and task constructs (except depend clause AFAIK).
However, I'm wondering if "taskloop" and "omp for" constructs differ performance wise too.

I think it may depends on the actual problem. To parallelize a for loop omp for can be faster than tasks, because it offers several different scheduling scheme for your needs. In my experience (solving a particular problem using clang12 compiler) omp for produces a bit faster code than tasks (on Ryzen 5 7800X).

Intel's pragma simd vs OpenMP's pragma omp simd

The Intel compiler allows us to vectorize loops via
#pragma simd
for ( ... )
However, you also have the option to do this with OpenMP 4's directive:
#pragma omp simd
for ( ... )
Is there any difference between the two?

For all intents and purposes they should be identical. The difference is that the OpenMP 4.0 #pragma omp simd directive is portable and should work with other compilers that support OpenMP 4.0 as well as Intel's.
Furthemore, there are several clauses in the OpenMP version which allow you to vectorize instructions in a more robust manner (safelen(), linear(), aligned(), reduction(), and collapse() come to mind).

#pragma simd
- is positioned as a part of Intel C++ Cilk SIMD extensions (along with Array Notation). This positioning looks slightly strange, because Cilk is not for Fortran, while Fortran Compiler supports mostly identical directive simd.
#pragma omp simd
- is a part of OpenMP standard, so it's naturally more portable across Compiler and Platforms.
Normally Intel rolls out new capabilities in Cilk first and in OpenMP second. The reason is very natural: it takes time to accept something as part of OpenMP standard. For example simdlen has only been added in OpenMP4.5 in November 2015, while identical clause (vectorlength) has been part of Cilk pragma 3 or 4 years ago already. simdlen vs. vectorlength highlights another observation that some pragma clauses syntax may differ between simd and omp simd.
So, if you need portability across compilers, use OpenMP pragma. But if it's more important for you to get access to new simd compiler capabilities as early as possible, then you may prefer or optionally use Cilk (proprietary) pragma. All the same arguments and considerations are equally applicable to #pragma omp declare simd vs. #pragma declare simd (which should answer your potential second question).
Reference of "Cilk" pragma simd clauses is available here: https://software.intel.com/en-us/node/524555
(I think it's slightly out of date; I've heard about new capabilities of pragma simd not reflected in this link yet).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

OpenMP pragma with a meaning: don't vectorize - openmp

I have a loop with a really low trip count where vectorization doesn't pay off, but the compiler vectorizes it nevertheless. Is there a portable way to tell a compiler not to vectorize a loop, an opposite of #pragma omp simd

If you have a compiler that supports the OpenMP API version 5.1, then it should accept this: #pragma omp simd if(simd:0) for (...) { ... } That should then disable vectorization for the loop associated with the simd construct. See simd Construct in the OpenMP specification.

Related

How OpenMP macros work behind the scenes in collaboration with the preprocessor/compiler and the library itself?

Is batching same functions with SIMD instruction possible?

best way to mix OpenACC with OpenMP

OpenMP: what is the difference between "taskloop" and "omp for" performance wise?

Intel's pragma simd vs OpenMP's pragma omp simd

Categories

Resources