I'm trying to use openmp in cython. I need to do two things in cython:
i) use the #pragma omp single{} scope in my cython code.
ii) use the #pragma omp barrier{}
Does anyone know how to do this in cython?
Here are more details. I have a nogil cdef-function my_fun() which I call in an omp for-loop:
from cython.parallel cimport prange
cimport openmp
cdef int i
with nogil:
for i in prange(10,schedule='static', num_threads=10):
my_func(i)
Inside my_func I need to place a barrier to wait for all threads to catch up, then execute a time-consuming operation only in one of the threads and with the gil acquired, and then release the barrier so all threads resume simultaneously.
cdef int my_func(...) nogil:
...
# put a barrier until all threads catch up, e.g. #pragma omp barrier
with gil:
# execute time consuming operation in one thread only, e.g. pragma omp single{}
# remove barrier after the above single thread has finished and continue the operation over all threads in parallel, e.g. #pragma omp barrier
...
Cython has some support for openmp, but it is probably easier to code in C and to wrap resulting code with Cython if openmp-pragmas are used extensively.
As alternative, you could use verbatim-C-code and tricks with defines to bring some of the functionality to Cython, but using of pragmas in defines isn't straight forward (_Pragma is a C99-solution, MSVC doing its own thing as always with __pragma), there are some examples as proof of concept for Linux/gcc:
cdef extern from *:
"""
#define START_OMP_PARALLEL_PRAGMA() _Pragma("omp parallel") {
#define END_OMP_PRAGMA() }
#define START_OMP_SINGLE_PRAGMA() _Pragma("omp single") {
#define START_OMP_CRITICAL_PRAGMA() _Pragma("omp critical") {
"""
void START_OMP_PARALLEL_PRAGMA() nogil
void END_OMP_PRAGMA() nogil
void START_OMP_SINGLE_PRAGMA() nogil
void START_OMP_CRITICAL_PRAGMA() nogil
we make Cython believe, that START_OMP_PARALLEL_PRAGMA() and Co. are nogil-function, so it put them into C-code and thus they get pick up by the preprocessor.
We must use the syntax
#pragma omp single{
//do_something
}
and not
#pragma omp single
do_something
because of the way Cython generates C-code.
The usage could look as follows (I'm avoiding here from cython.parallel.parallel as it does too much magic for this simple example):
%%cython -c=-fopenmp --link-args=-fopenmp
cdef extern from *:# as listed above
...
def test_omp():
cdef int a=0
cdef int b=0
with nogil:
START_OMP_PARALLEL_PRAGMA()
START_OMP_SINGLE_PRAGMA()
a+=1
END_OMP_PRAGMA()
START_OMP_CRITICAL_PRAGMA()
b+=1
END_OMP_PRAGMA() # CRITICAL
END_OMP_PRAGMA() # PARALLEL
print(a,b)
Calling test_omp prints "1 2" on my machine with 2 threads, as expected (one could change the number of threads using openmp.omp_set_num_threads(10)).
However, the above is still very brittle - some error checking by Cython can lead to invalid code (Cython uses goto for control flow and it is not possible to jump out of openmp-block). Something like this happens in your example:
cimport numpy as np
import numpy as np
def test_omp2():
cdef np.int_t[:] a=np.zeros(1,dtype=int)
START_OMP_SINGLE_PRAGMA()
a[0]+=1
END_OMP_PRAGMA()
print(a)
Because of bounding checking, Cython will produce:
START_OMP_SINGLE_PRAGMA();
...
//check bounds:
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
__PYX_ERR(0, 30, __pyx_L1_error) // HERE WE GO A GOTO!
}
...
END_OMP_PRAGMA();
In this special case setting boundcheck to false, i.e.
cimport cython
#cython.boundscheck(False)
def test_omp2():
...
would solve the issue for the above example, but probably not in general.
Once again: using openmp in C (and wrapping the functionality with Cython) is a more enjoyable experience.
As a side note: Python-threads (the ones governed by GIL) and openmp-threads are different and know nothing about eachother. The above example would also work (compile and run) correctly without releasing the GIL - openmp-threads do not care about GIL, but as there are no Python-objects involved nothing can go wrong. Thus I have added nogil to the wrapped "functions", so it can also be used in nogil blocks.
However, when code gets more complicated it becomes less obvious, that the variables shared between different Python-threads aren't accessed (all above because those accesses could happen in the generated C-code and this doesn't become clear from the Cython-code), it might be wiser not to release gil, while using openmp.
Related
I have a shared variable s and private variable p inside parallel region.
How can I do the following atomically (or at least better than with critical section):
if ( p > s )
s = p;
else
p = s;
I.e., I need to update the global maximum (if local maximum is better) or read it, if it was updated by another thread.
OpenMP 5.1 introduced the compare clause which allows compare-and-swap (CAS) operations such as
#pragma omp atomic compare
if (s < p) s = p;
In combination with a capture clause, you should be able to achieve what you want:
int s_cap;
// here we capture the shared variable and also update it if p is larger
#pragma omp atomic compare capture
{
s_cap = s;
if (s < p) s = p;
}
// update p if the captured shared value is larger
if (s_cap > p) p = s_cap;
The only problem? The 5.1 spec is very new and, as of today (2020-11-27), none of the widespread compilers, i.e., those available on Godbolt, supports OpenMP 5.1. See here for a more or less up-to-date list. Adding compare is still listed as an unclaimed task on Clang's OpenMP page. GCC is still working on full OpenMP 5.0 support and the trunk build on Godbolt doesn't recognise compare. Intel's oneAPI compiler may or may not support it - it's not available on Godbolt and I can't get it to compile OpenMP code.
Your best bet for now is to use atomic capture combined with a compiler-specific CAS atomic, possibly in a loop.
I am trying to understand the exact difference between #pragma omp critical and #pragma omp single in OpenMP:
Microsoft definitions for these are:
Single: Lets you specify that a section of code should be executed on
a single thread, not necessarily the master thread.
Critical: Specifies that code is only be executed on one thread at a
time.
So it means that in both, the exact section of code afterwards would be executed by just one thread and other threads will not enter that section e.g. if we print something, we will see the result on screen once, right?
How about the difference? It looks that critical take care of time of execution, but not single! But I don't see any difference in practice! Does it mean that a kind of waiting or synchronization for other threads (which do not enter that section) is considered in critical, but there is nothing that holds other threads in single? How it can change the outcome in practice?
I appreciate if anyone can clarify this to me especially by an example. Thanks!
single and critical are two very different things. As you mentioned:
single specifies that a section of code should be executed by single thread (not necessarily the master thread)
critical specifies that code is executed by one thread at a time
So the former will be executed only once while the later will be executed as many times as there are of threads.
For example the following code
int a=0, b=0;
#pragma omp parallel num_threads(4)
{
#pragma omp single
a++;
#pragma omp critical
b++;
}
printf("single: %d -- critical: %d\n", a, b);
will print
single: 1 -- critical: 4
I hope you see the difference better now.
For the sake of completeness, I can add that:
master is very similar to single with two differences:
master will be executed by the master only while single can be executed by whichever thread reaching the region first; and
single has an implicit barrier upon completion of the region, where all threads wait for synchronization, while master doesn't have any.
atomic is very similar to critical, but is restricted for a selection of simple operations.
I added these precisions since these two pairs of instructions are often the ones people tend to mix-up...
single and critical belong to two completely different classes of OpenMP constructs. single is a worksharing construct, alongside for and sections. Worksharing constructs are used to distribute a certain amount of work among the threads. Such constructs are "collective" in the sense that in correct OpenMP programs all threads must encounter them while executing and moreover in the same sequential order, also including the barrier constructs. The three worksharing constructs cover three different general cases:
for (a.k.a. loop construct) distributes automatically the iterations of a loop among the threads - in most cases all threads get work to do;
sections distributes a sequence of independent blocks of code among the threads - some threads get work to do. This is a generalisation of the for construct as a loop with 100 iterations could be expressed as e.g. 10 sections of loops with 10 iterations each.
single singles out a block of code for execution by one thread only, often the first one to encounter it (an implementation detail) - only one thread gets work. single is to a great extent equivalent to sections with a single section only.
A common trait of all worksharing constructs is the presence of an implicit barrier at their end, which barrier might be turned off by adding the nowait clause to the corresponding OpenMP construct, but the standard does not require such behaviour and with some OpenMP runtimes the barrier might continue to be there despite the presence of nowait. Incorrectly ordered (i.e. out of sequence in some of the threads) worksharing constructs might therefore lead to deadlocks. A correct OpenMP program will never deadlock when the barriers are present.
critical is a synchronisation construct, alongside master, atomic, and others. Synchronisation constructs are used to prevent race conditions and to bring order in the execution of things.
critical prevents race conditions by preventing the simultaneous execution of code among the threads in the so-called contention group. This means all threads from all parallel regions encountering similarly named critical constructs get serialised;
atomic turns certain simple memory operations into atomic ones, usually by utilising special assembly instructions. Atomics complete at once as a single non-breakable unit. For example, an atomic read from some location by one thread, which happens concurrently with an atomic write to the same location by another thread, will either return the old value or the updated value, but never some kind of an intermediate mash-up of bits from both the old and the new values;
master singles out a block of code for execution by the master thread (thread with ID of 0) only. Unlike single, there is no implicit barrier at the end of the construct and also there is no requirement that all threads must encounter the master construct. Also, the lack of implicit barrier means that master does not flush the shared memory view of the threads (this is an important but very poorly understood part of OpenMP). master is basically a shorthand for if (omp_get_thread_num() == 0) { ... }.
critical is a very versatile construct as it is able to serialise different pieces of code in very different parts of the program code, even in different parallel regions (significant in the case of nested parallelism only). Each critical construct has an optional name provided in parenthesis immediately after. Anonymous critical constructs share the same implementation-specific name. Once a thread enters such a construct, any other thread encountering another construct of the same name is put on hold until the original thread exits its construct. Then the serialisation process continues with the rest of the threads.
An illustration of the concepts above follows. The following code:
#pragma omp parallel num_threads(3)
{
foo();
bar();
...
}
results in something like:
thread 0: -----< foo() >< bar() >-------------->
thread 1: ---< foo() >< bar() >---------------->
thread 2: -------------< foo() >< bar() >------>
(thread 2 is purposely a latecomer)
Having the foo(); call within a single construct:
#pragma omp parallel num_threads(3)
{
#pragma omp single
foo();
bar();
...
}
results in something like:
thread 0: ------[-------|]< bar() >----->
thread 1: ---[< foo() >-|]< bar() >----->
thread 2: -------------[|]< bar() >----->
Here [ ... ] denotes the scope of the single construct and | is the implicit barrier at its end. Note how the latecomer thread 2 makes all other threads wait. Thread 1 executes the foo() call as the example OpenMP runtime chooses to assign the job to the first thread to encounter the construct.
Adding a nowait clause might remove the implicit barrier, resulting in something like:
thread 0: ------[]< bar() >----------->
thread 1: ---[< foo() >]< bar() >----->
thread 2: -------------[]< bar() >---->
Having the foo(); call within an anonymous critical construct:
#pragma omp parallel num_threads(3)
{
#pragma omp critical
foo();
bar();
...
}
results in something like:
thread 0: ------xxxxxxxx[< foo() >]< bar() >-------------->
thread 1: ---[< foo() >]< bar() >------------------------->
thread 2: -------------xxxxxxxxxxxx[< foo() >]< bar() >--->
With xxxxx... is shown the time a thread spends waiting for other threads executing a critical construct of the same name before it could enter its own construct.
Critical constructs of different names do not synchronise with each other. E.g.:
#pragma omp parallel num_threads(3)
{
if (omp_get_thread_num() > 1) {
#pragma omp critical(foo2)
foo();
}
else {
#pragma omp critical(foo01)
foo();
}
bar();
...
}
results in something like:
thread 0: ------xxxxxxxx[< foo() >]< bar() >---->
thread 1: ---[< foo() >]< bar() >--------------->
thread 2: -------------[< foo() >]< bar() >----->
Now thread 2 does not synchronise with the other threads because its critical construct is named differently and therefore makes a potentially dangerous simultaneous call into foo().
On the other hand, anonymous critical constructs (and in general constructs with the same name) synchronise with one another no matter where in the code they are:
#pragma omp parallel num_threads(3)
{
#pragma omp critical
foo();
...
#pragma omp critical
bar();
...
}
and the resulting execution timeline:
thread 0: ------xxxxxxxx[< foo() >]< ... >xxxxxxxxxxxxxxx[< bar() >]------------>
thread 1: ---[< foo() >]< ... >xxxxxxxxxxxxxxx[< bar() >]----------------------->
thread 2: -------------xxxxxxxxxxxx[< foo() >]< ... >xxxxxxxxxxxxxxx[< bar() >]->
What are the gcc command line statements to know the pthread calls for openmp directives? I know about the -fdump command line statements for generating IR file in assembly, gimple, rtl, trees. But I am unable to get any pthread dumps for openmp directives.
GCC does not directly convert OpenMP pragmas into Pthreads code. Rather it converts each OpenMP construct into a set of calls to the GNU OpenMP run-time library libgomp. You could get the intermediate representation by compiling with -fdump-tree-all. Look for a file (or files) with extension .ompexp.
Example:
#include <stdio.h>
int main() {
int i;
#pragma omp parallel for
for(i=0; i<100; i++) {
printf("asdf\n");
}
}
The corresponding section of the .ompexp file that implements the parallel region:
<bb 2>:
__builtin_GOMP_parallel_start (main.omp_fn.0, 0B, 0);
main.omp_fn.0 (0B);
__builtin_GOMP_parallel_end ();
GCC implements parallel regions via code outlining and in that case main.omp_fn.0 is the function that contains the body of the parallel region. In the function itself (omitted here for brevity) the for worksharing construct is implemented by using some simple mathematical calculations that determine the range of iterations for the corresponding thread.
My code has following structure
<serial-code-1>
#pragma omp parallel
{
<parallel-code>
}
<serial-code-2>
I want to remove the implicit barrier synchronization at the end of parallel region something like nowait. so that any thread that finishes first can start doing serial-code-2 ( It will require some changes in the serial code 2) but its possible. How is it possible to achieve something like this?
Perhaps
<serial-code-1>
#pragma omp parallel
{
<parallel-code>
#pragma omp single
{
<serial-code-2>
}
}
The code inside the scope of the single directive the serial code will be executed by only one thread, probably the first one to finish executing the parallel code.
One OpenMP directive I have never used and don't know when to use is flush(with and without a list).
I have two questions:
1.) When is an explicit `omp flush` or `omp flush(var1, ...) necessary?
2.) Is it sometimes not necessary but helpful (i.e. can it make the code fast)?
The main reason I can't understand when to use an explicit flush is that flushes are done implicitly after many directives (e.g. as barrier, single, ...) which synchronize the threads. I can't, for example, see way using flush and not synchronizing (e.g. with nowait) would be helpful.
I understand that different compilers may implement omp flush in different ways. Some may interpret a flush with a list as as one without (i.e. flush all shared objects) OpenMP flush vs flush(list). But I only care about what the specification requires. In other words, I want to know where an explicit flush in principle may be necessary or helpful.
Edit: I think I need to clarify my second question. Let me give an example. I would like to know if there are cases where removing an implicit flush (e.g. with nowait) and instead using an explicit flush instead but only on certain shared variables would be faster (and still give the correct result). Something like the following:
float a,b;
#pragma omp parallel
{
#pragma omp for nowait // No barrier. Do not flush on exit.
//code which uses only shared variable a
#pragma omp flush(a) // Flush only variable a rather than all shared variables.
#pragma omp for
//Code which uses both shared variables a and b.
}
I think that code still needs a barrier after the the first for loop but all barriers have an implicit flush so that defeats the purpose. Is it possible to have a barrier which does not do a flush?
The flush directive tells the OpenMP compiler to generate code to make the thread's private view on the shared memory consistent again. OpenMP usually handles this pretty well and does the right thing for typical programs. Hence, there's no need for flush.
However, there are cases where the OpenMP compiler needs some help. One of these cases is when you try to implement your own spin lock. In these cases, you would need a combination of flushes to make things work, since otherwise the spin variables will not be updated. Getting the sequence of flushes correct will be tough and will be very, very error prone.
The general recommendation is that flushes should not be used. If at all, programmers should avoid flush with a list (flush(var,...)) at all means. Some folks are actually talking about deprecating it in future OpenMP.
Performance-wise the impact of flush should be more negative than positive. Since it causes the compiler to generate memory fences and additional load/store operations, I would expect it to slow down things.
EDIT: For your second question, the answer is no. OpenMP makes sure that each thread has a consistent view on the shared memory when it needs to. If threads do not synchronize, they do not need to update their view on the shared memory, because they do not see any "interesting" change there. That means that any read a thread makes does not read any data that has been changed by some other thread. If that would be the case, then you'd have a race condition and a potential bug in your program. To avoid the race, you need to synchronize (which then implies a flush to make each participating thread's view consistent again). A similar argument applies to barriers. You use barriers to start a new epoch in the computation of a parallel region. Since you're keeping the threads in lock-step, you will very likely also have some shared state between the threads that has been computed in the previous epoch.
BTW, OpenMP may keep private data for a thread, but it does not have to. So, it is likely that the OpenMP compiler will keep variables in registers for a while, which causes them to be out of sync with the shared memory. However, updates to array elements are typically reflected pretty soon in the shared memory, since the amount of private storage for a thread is usually small (register sets, caches, scratch memory, etc.). OpenMP only gives you some weak restrictions on what you can expect. An actual OpenMP implementation (or the hardware) may be as strict as it wishes to be (e.g., write back any change immediately and to flushes all the time).
Not exactly an answer, but Michael Klemm's question is closed for comments. I think an excellent example of why flushes are so hard to understand and use properly is the following one copied (and shortened a bit) from the OpenMP Examples:
//http://www.openmp.org/wp-content/uploads/openmp-examples-4.0.2.pdf
//Example mem_model.2c, from Chapter 2 (The OpenMP Memory Model)
int main() {
int data, flag = 0;
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num()==0) {
/* Write to the data buffer that will be read by thread */
data = 42;
/* Flush data to thread 1 and strictly order the write to data
relative to the write to the flag */
#pragma omp flush(flag, data)
/* Set flag to release thread 1 */
flag = 1;
/* Flush flag to ensure that thread 1 sees S-21 the change */
#pragma omp flush(flag)
}
else if (omp_get_thread_num()==1) {
/* Loop until we see the update to the flag */
#pragma omp flush(flag, data)
while (flag < 1) {
#pragma omp flush(flag, data)
}
/* Values of flag and data are undefined */
printf("flag=%d data=%d\n", flag, data);
#pragma omp flush(flag, data)
/* Values data will be 42, value of flag still undefined */
printf("flag=%d data=%d\n", flag, data);
}
}
return 0;
}
Read the comments and try to understand.