OpenMP parallel for - what is default schedule? - parallel-processing

What schedule algorithm is used when no schedule clause is specified? I.e.:
#pragma omp parallel for
for (int i = 0; i < n; ++i)
Foo(i);

Start from the documentation that you have linked to. Section 2.7.1.1 Determining the Schedule of a Worksharing Loop reads:
If the loop directive does not have a schedule clause then the current value of the def-sched-var ICV determines the schedule.
The sentence preceding the quoted one refers to Section 2.3.1 which reads:
def-sched-var - controls the implementation defined default scheduling of loop regions. There is one copy of this ICV per device.
The table in Section 2.3.2 ICV Initialization states that the initial value of def-sched-var is implementation defined and that there is no environment variable that affects that value. Therefore the default loop schedule is implementation defined. Q.E.D.

Related

OpenMP atomic compare and swap

I have a shared variable s and private variable p inside parallel region.
How can I do the following atomically (or at least better than with critical section):
if ( p > s )
s = p;
else
p = s;
I.e., I need to update the global maximum (if local maximum is better) or read it, if it was updated by another thread.
OpenMP 5.1 introduced the compare clause which allows compare-and-swap (CAS) operations such as
#pragma omp atomic compare
if (s < p) s = p;
In combination with a capture clause, you should be able to achieve what you want:
int s_cap;
// here we capture the shared variable and also update it if p is larger
#pragma omp atomic compare capture
{
s_cap = s;
if (s < p) s = p;
}
// update p if the captured shared value is larger
if (s_cap > p) p = s_cap;
The only problem? The 5.1 spec is very new and, as of today (2020-11-27), none of the widespread compilers, i.e., those available on Godbolt, supports OpenMP 5.1. See here for a more or less up-to-date list. Adding compare is still listed as an unclaimed task on Clang's OpenMP page. GCC is still working on full OpenMP 5.0 support and the trunk build on Godbolt doesn't recognise compare. Intel's oneAPI compiler may or may not support it - it's not available on Godbolt and I can't get it to compile OpenMP code.
Your best bet for now is to use atomic capture combined with a compiler-specific CAS atomic, possibly in a loop.

Controlling number of cores being employed by parallel_run function of Scilab

How can I set the number of cores parallel_run function of Scilab uses?
From the parallel_run documentation:
Tuning the Parallelization with Configuration Parameters
As we have seen in the calling sequence, it is possible to add a
configuration parameter as a last argument to parallel_run. This
argument is handled by the params module and created with init_param()
(further information on how to handle parameters can be found in the
help pages of add_param, set_param and remove_param).
Number of workers
The number of computing resources used in parallel can be set by the
parameter nb_workers. The default value (0) uses as many workers as
there are cores available.
Code sample
function a=g(arg1)
a=arg1*arg1
endfunction
res=parallel_run(1:10, g, init_param('nb_workers', 2));

Does the main thread surely enters an OpenMP parallel section?

I need to check continuously through a callback provided by a closed source module a terminating condition. The thread then enters a parallel section. I don't know if this callback is safe to call from other threads other the one that received it, so if I want to use it from within the parallel section, I should have only the main, "originating" thread call it. I can do that, but I need the assumption that the main thread always enters it, or the callback won't be called. Does that hold?
From the OpenMP 4.5 Complete Specifications
When any thread encounters a parallel construct, the thread creates a
team of itself and zero or more additional threads and becomes the
master of the new team.
So yes the main thread enters the oMP section.
Note that the term section may not be appropriate here, as if you have several #pragma omp section, you won't know which thread will execute which one.
What you describe fits
#pragma omp master
See What is the benefit of '#pragma omp master' as opposed to '#pragma omp single'? for more details

explict flush directive with OpenMP: when is it necessary and when is it helpful

One OpenMP directive I have never used and don't know when to use is flush(with and without a list).
I have two questions:
1.) When is an explicit `omp flush` or `omp flush(var1, ...) necessary?
2.) Is it sometimes not necessary but helpful (i.e. can it make the code fast)?
The main reason I can't understand when to use an explicit flush is that flushes are done implicitly after many directives (e.g. as barrier, single, ...) which synchronize the threads. I can't, for example, see way using flush and not synchronizing (e.g. with nowait) would be helpful.
I understand that different compilers may implement omp flush in different ways. Some may interpret a flush with a list as as one without (i.e. flush all shared objects) OpenMP flush vs flush(list). But I only care about what the specification requires. In other words, I want to know where an explicit flush in principle may be necessary or helpful.
Edit: I think I need to clarify my second question. Let me give an example. I would like to know if there are cases where removing an implicit flush (e.g. with nowait) and instead using an explicit flush instead but only on certain shared variables would be faster (and still give the correct result). Something like the following:
float a,b;
#pragma omp parallel
{
#pragma omp for nowait // No barrier. Do not flush on exit.
//code which uses only shared variable a
#pragma omp flush(a) // Flush only variable a rather than all shared variables.
#pragma omp for
//Code which uses both shared variables a and b.
}
I think that code still needs a barrier after the the first for loop but all barriers have an implicit flush so that defeats the purpose. Is it possible to have a barrier which does not do a flush?
The flush directive tells the OpenMP compiler to generate code to make the thread's private view on the shared memory consistent again. OpenMP usually handles this pretty well and does the right thing for typical programs. Hence, there's no need for flush.
However, there are cases where the OpenMP compiler needs some help. One of these cases is when you try to implement your own spin lock. In these cases, you would need a combination of flushes to make things work, since otherwise the spin variables will not be updated. Getting the sequence of flushes correct will be tough and will be very, very error prone.
The general recommendation is that flushes should not be used. If at all, programmers should avoid flush with a list (flush(var,...)) at all means. Some folks are actually talking about deprecating it in future OpenMP.
Performance-wise the impact of flush should be more negative than positive. Since it causes the compiler to generate memory fences and additional load/store operations, I would expect it to slow down things.
EDIT: For your second question, the answer is no. OpenMP makes sure that each thread has a consistent view on the shared memory when it needs to. If threads do not synchronize, they do not need to update their view on the shared memory, because they do not see any "interesting" change there. That means that any read a thread makes does not read any data that has been changed by some other thread. If that would be the case, then you'd have a race condition and a potential bug in your program. To avoid the race, you need to synchronize (which then implies a flush to make each participating thread's view consistent again). A similar argument applies to barriers. You use barriers to start a new epoch in the computation of a parallel region. Since you're keeping the threads in lock-step, you will very likely also have some shared state between the threads that has been computed in the previous epoch.
BTW, OpenMP may keep private data for a thread, but it does not have to. So, it is likely that the OpenMP compiler will keep variables in registers for a while, which causes them to be out of sync with the shared memory. However, updates to array elements are typically reflected pretty soon in the shared memory, since the amount of private storage for a thread is usually small (register sets, caches, scratch memory, etc.). OpenMP only gives you some weak restrictions on what you can expect. An actual OpenMP implementation (or the hardware) may be as strict as it wishes to be (e.g., write back any change immediately and to flushes all the time).
Not exactly an answer, but Michael Klemm's question is closed for comments. I think an excellent example of why flushes are so hard to understand and use properly is the following one copied (and shortened a bit) from the OpenMP Examples:
//http://www.openmp.org/wp-content/uploads/openmp-examples-4.0.2.pdf
//Example mem_model.2c, from Chapter 2 (The OpenMP Memory Model)
int main() {
int data, flag = 0;
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num()==0) {
/* Write to the data buffer that will be read by thread */
data = 42;
/* Flush data to thread 1 and strictly order the write to data
relative to the write to the flag */
#pragma omp flush(flag, data)
/* Set flag to release thread 1 */
flag = 1;
/* Flush flag to ensure that thread 1 sees S-21 the change */
#pragma omp flush(flag)
}
else if (omp_get_thread_num()==1) {
/* Loop until we see the update to the flag */
#pragma omp flush(flag, data)
while (flag < 1) {
#pragma omp flush(flag, data)
}
/* Values of flag and data are undefined */
printf("flag=%d data=%d\n", flag, data);
#pragma omp flush(flag, data)
/* Values data will be 42, value of flag still undefined */
printf("flag=%d data=%d\n", flag, data);
}
}
return 0;
}
Read the comments and try to understand.

Difference between OpenMP threadprivate and private

I am trying to parallelize a C program using OpenMP.
I would like to know more about:
The differences between the threadprivate directive and the private clause and
In which cases we must use any of them.
As far as I know, the difference is the global scope with threadprivate and the preserved value across parallel regions. I found in several examples that when a piece of code contains some global/static variables that must be privatized, these variables are included in a threadprivate list and their initial values are copied into the private copies using copyin.
However, is there any rule that prevents us to use the private clause to deal with global/static variables? perhaps any implementation detail?
I couldn't find any explanation in the OpenMP3.0 specification.
The most important differences you have to memorize:
A private variable is local to a region and will most of the time be placed on the stack. The lifetime of the variable's privacy is the duration defined of the data scoping clause. Every thread (including the master thread) makes a private copy of the original variable (the new variable is no longer storage-associated with the original variable).
A threadprivate variable on the other hand will be most likely placed in the heap or in the thread local storage (that can be seen as a global memory local to a thread). A threadprivate variable persist across regions (depending on some restrictions). The master thread uses the original variable, all other threads make a private copy of the original variable (the master variable is still storage-associated with the original variable).
There are also more tricky differences:
Variables defined as private are undefined for each thread upon entering the construct and the corresponding shared variable is undefined when the parallel construct is exited; the initial status of a private pointer is undefine.
But data in the threadprivate common blocks should be assumed to be undefined on entry to the first parallel region unless a copyin clause is specified. When a common block appears in a threadprivate directive, each thread copy is initialized once prior to its first use.
The OpenMP Specifications (section 2.14.2) actually give a very good description (and also more detailled) of the threadprivate directive:
Each copy of a threadprivate variable is initialized once, in the manner specified by the program, but at an unspecified point in the program prior to the first reference to that copy. The storage of all copies of a threadprivate variable is freed according to how static variables are handled in the base language, but at an unspecified point in the program.
A program in which a thread references another thread’s copy of a threadprivate variable is non-conforming.
The content of a threadprivate variable can change across a task scheduling point if the executing thread switches to another task that modifies the variable. For more details on task scheduling, see Section 1.3 on page 14 and Section 2.11 on page 113.
In parallel regions, references by the master thread will be to the copy of the variable in the thread that encountered the parallel region.
During a sequential part references will be to the initial thread’s copy of the variable. The values of data in the initial thread’s copy of a threadprivate variable are guaranteed to persist between any two consecutive references to the variable in the program.
The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all the following conditions hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The thread affinity policies used to execute both parallel regions are the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.
If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.

Resources