How are firstprivate and lastprivate different than private clauses in OpenMP? - openmp

I've looked at the official definitions, but I'm still quite confused.
firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.
To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.
lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
I feel like I understand this one a bit better because of the following example:
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.
I just started learning OpenMP today.

private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel private(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
With four threads it outputs something like:
thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10
(another run of the same program)
thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10
This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).
If i is made firstprivate, then it is initialised with the value that it has before the parallel region:
thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10
Still modifications to the value of i inside the parallel region are not visible after it.
You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).
So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.

You cannot use local variable i before initialization, the program will give an error since C++ 14 Standard.

Related

Difference between OMP_NUM_THREADS and OMP_THREAD_LIMIT [duplicate]

Heluuu,
I have a rather large program that I'm attempting to thread. So far, this has been succesful, and the basics are all working as intended.
I now want to do some fancy work with cascading threads in nested mode. Essentially, I want the main parallel region to use any free threads in lower parallel regions.
To detail the current system, the main parallel region starts 10 threads. I have 12 cores, so I can use 2 more threads. There is a second parallel region where some heavy computing happens, and I want the first two threads to reach this point to start a new team there, each with 2 threads. Every new entry to the lower parallel region after this will continue in serial.
So, this should look like the following.
Main region: 10 threads started.
Lower region: 2 new threads started.
Thread 1: 2 threads in lower region.
Thread 2: 2 threads in lower region.
Thread 3-10: 1 thread in lower region.
Please keep in mind that these numbers are for the sake of clarity in providing a concrete description of my situation, and not the absolute and only case in which the program operates.
The code:
main() {
...
...
omp_set_num_threads(n);
omp_set_dynamic(x);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < iterations; i++) {
...
Compute();
...
}
}
}
And in Compute
bool Compute() {
...
float nThreads = omp_get_thread_limit() - omp_get_num_threads();
nThreads = ceil(nThreads / omp_get_num_threads());
omp_set_num_threads((int)nThreads);
#pragma omp parallel
{
...
#pragma omp for
for (int i = 0; i < nReductSize; i++) {
...
}
}
}
Now, my problem is that setting the uppermost limit for the whole program (i.e. OMP_THREAD_LIMIT) only works from outside the program. Using
export OMP_THREAD_LIMIT=5
from the bash command line works great. But I want to do it internally. So far, I've tried
putenv("OMP_THREAD_LIMIT=12");
setenv("OMP_THREAD_LIMIT", "12", 1);
but when I call omp_get_thread_limit() or getenv("OMP_THREAD_LIMIT") I get wacky return values. Even when I set the variable with export, calling getenv("OMP_THREAD_LIMIT"); returns 0.
So, I would ask for your help in this: How do I properly set OMP_THREAD_LIMIT at runtime?
This is the main function where I set the thread defaults. It is executed well before any threading occurs:
#ifdef _OPENMP
const char *name = "OMP_THREAD_LIMIT";
const char *value = "5";
int overwrite = 1;
int success = setenv(name, value, overwrite);
cout << "Var set (0 is success): " << success << endl;
#endif
Oh, and setenv reports success in setting the variable.
Compiler says
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
Flags
CCFLAGS = -c -O0 -fopenmp -g -msse -msse2 -msse3 -mfpmath=sse -std=c++0x
OpenMP version is 3.0.
This is correct implementation of OpenMP, and it ignores changes in environment from inside the program. As stated in OpenMP 3.1 Standard, page 159:
Modifications to the environment variables after the program has started, even if
modified by the program itself, are ignored by the OpenMP implementation.
You are doing exactly what is said in this paragraph.
OpenMP allows changing of such parameters only via omp_set_* functions, but there are no such function for thread-limit-var ICV:
However, the settings of some of the ICVs can be modified during the execution of the OpenMP
program by the use of the appropriate directive clauses or OpenMP API routines.
I think, you may use num_threads clause of #pragma omp parallel to achieve what you want.
Changing the behavior of OpenMP using OMP_THREAD_LIMIT (or any other OMP_* environment variable) is not possible after the program has started; these are intended for use by the user. You could have the user invoke your program through a script that sets OMP_THREAD_LIMIT and then calls your program, but that's probably not what you need to do in this case.
OMP_NUM_THREADS, omp_set_num_threads, and the num_threads clause are usually used to set the number of threads operating in a region.
It might be offtopic, but you may want to try openmp collapse instead of handcrafting here.

Difference between num_threads vs. omp_set_num_threads vs OMP_NUM_THREADS

I am quite confused about the ways to specify the number of threads in parallel part of a code.
I know I can use:
the enviromental variable OMP_NUM_THREADS
function omp_set_num_threads(int)
num_threads(int) in #pragma omp parallel for num_threads(NB_OF_THREADS)
What I have gathered so far the first two are equivalent. But what about the third one?
Can someone provide a more detailed exposition of the difference, I could not find any information in the internet regarding the difference between 1/2 and 3.
OMP_NUM_THREADS and omp_set_num_threads() are not equivalent. The environment variable is only used to set the initial value of the nthreads-var ICV (internal control variable) which controls the maximum number of threads in a team. omp_set_num_threads() can be used to change the value of nthreads-var at any time (outside of any parallel regions, of course) and affects all subsequent parallel regions. Therefore setting a value, e.g. n, to OMP_NUM_THREADS is equivalent to calling omp_set_num_threads(n) before the very first parallel region is encountered.
The algorithm to determine the number of threads in a parallel region is very clearly described in the OpenMP specification that is available freely on the OpenMP website:
if a num_threads clause exists
then let ThreadsRequested be the value of the num_threads clause expression;
else let ThreadsRequested = value of the first element of nthreads-var;
That priority of the different ways to set nthreads-var is listed in the ICV Override Relationships part of the specification:
The num_threads clause and omp_set_num_threads() override the value of the OMP_NUM_THREADS environment variable and the initial value of the first element of the nthreads-var ICV.
Translated into human language, that is:
OMP_NUM_THREADS (if present) specifies initially the number of threads;
calls to omp_set_num_threads() override the value of OMP_NUM_THREADS;
the presence of the num_threads clause overrides both other values.
The actual number of threads used is also affected by whether dynamic team sizes are enabled (dyn-var ICV settable via OMP_DYNAMIC and/or omp_set_dynamic()), by whether a thread limit is imposed by thread-limit-var (settable via OMP_THREAD_LIMIT), as well as by whether nested parallelism (OMP_NESTED / omp_set_nested()) is enabled or not.
Think of it like scope. Option 3 (num_threads) sets the number of threads for the current team of threads only. The other options are global/state settings. I generally don't set the number of threads and instead I just use the defaults. When I do change the number of threads it's usually only in special cases so I use option three so that the next time I use a parallel team it goes back to the global (default) setting. See the code below. After I use option 3 the next team of threads goes back to the last global setting.
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
omp_set_num_threads(8);
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
}
4
8
2
8

Is it possible to inject values in the frama-c value analyzer?

I'm experimenting with the frama-c value analyzer to evaluate C-Code, which is actually threaded.
I want to ignore any threading problems that might occur und just inspect the possible values for a single thread. So far this works by setting the entry point to where the thread starts.
Now to my problem: Inside one thread I read values that are written by another thread, because frama-c does not (and should not?) consider threading (currently) it assumes my variable is in some broad range, but I know that the range is in fact much smaller.
Is it possible to tell the value analyzer the value range of this variable?
Example:
volatile int x = 0;
void f() {
while(x==0)
sleep(100);
...
}
Here frama-c detects that x is volatile and thus has range [--..--], but I know what the other thread will write into x, and I want to tell the analyzer that x can only be 0 or 1.
Is this possible with frama-c, especially in the gui?
Thanks in advance
Christian
This is currently not possible automatically. The value analysis considers that volatile variables always contain the full range of values included in their underlying type. There however exists a proprietary plug-in that transforms accesses to volatile variables into calls to user-supplied function. In your case, your code would be transformed into essentially this:
int x = 0;
void f() {
while(1) {
x = f_volatile_x();
if (x == 0)
sleep(100);
...
}
By specifying f_volatile_x correctly, you can ensure it returns values between 0 and 1 only.
If the variable 'x' is not modified in the thread you are studying, you could also initialize it at the beginning of the 'main' function with :
x = Frama_C_interval (0, 1);
This is a function defined by Frama-C in ...../share/frama-c/builtin.c so you have to add this file to your inputs when you use it.

C/OpenMP - issue with threadprivate and vectors of pointers

I'm new to the world of parallel programming and openmp, so this may be a futile question, but I can't really come up with good answer to what I'm experiencing, so I hope someone will be able to shed some light on the matter.
What I am trying to achieve is to have a private copy of a dinamically allocated matrix (of integers) for every thread that will handle the following parallel section, but as soon as the flow of execution enters said region the reference to the supposedly private matrix holds a null value.
Is there any limitation of this directive I'm not aware of? Everything seems to work just fine with monodimensional dynamic arrays.
A snippet of the code is the following one...
#define n 10000
int **matrix;
#pragma omp threadprivate(matrix)
int main()
{
matrix = (int**) calloc(n, sizeof(int*));
for(i=0;i<n;i++) matrix[i] = (int*) calloc(n, sizeof(int));
AdjacencyMatrix(n, matrix);
...
/* Explicitly turn off dynamic threads */
omp_set_dynamic(0);
#pragma omp parallel
{
// From now on, matrix is NULL...
executor_p(matrix, n);
}
....
Look at the OpenMP documentation regarding what happens with the threadprivate clause:
On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive
There's no guarantee of what value is going to be stored in the matrix variable in the parallel region.
OpenMP can privatise only variables with known storage size. That is you can have a private copy of an array if it was defined like double matrix[N][M]. In your case is not only the storage size unknown (a pointer doesn't store the number of elements that it is pointing to) but also your matrix is not a contiguous area in memory and rather a pointer to a list of dynamically allocated rows.
What you would end up with is having a private copy of the top-level pointer, not a private copy of the matrix data itself.

How do you measure the time a function takes to execute?

How can you measure the amount of time a function will take to execute?
This is a relatively short function and the execution time would probably be in the millisecond range.
This particular question relates to an embedded system, programmed in C or C++.
The best way to do that on an embedded system is to set an external hardware pin when you enter the function and clear it when you leave the function. This is done preferably with a little assembly instruction so you don't skew your results too much.
Edit: One of the benefits is that you can do it in your actual application and you don't need any special test code. External debug pins like that are (should be!) standard practice for every embedded system.
There are three potential solutions:
Hardware Solution:
Use a free output pin on the processor and hook an oscilloscope or logic analyzer to the pin. Initialize the pin to a low state, just before calling the function you want to measure, assert the pin to a high state and just after returning from the function, deassert the pin.
*io_pin = 1;
myfunc();
*io_pin = 0;
Bookworm solution:
If the function is fairly small, and you can manage the disassembled code, you can crack open the processor architecture databook and count the cycles it will take the processor to execute every instructions. This will give you the number of cycles required.
Time = # cycles * Processor Clock Rate / Clock ticks per instructions
This is easier to do for smaller functions, or code written in assembler (for a PIC microcontroller for example)
Timestamp counter solution:
Some processors have a timestamp counter which increments at a rapid rate (every few processor clock ticks). Simply read the timestamp before and after the function.
This will give you the elapsed time, but beware that you might have to deal with the counter rollover.
Invoke it in a loop with a ton of invocations, then divide by the number of invocations to get the average time.
so:
// begin timing
for (int i = 0; i < 10000; i++) {
invokeFunction();
}
// end time
// divide by 10000 to get actual time.
if you're using linux, you can time a program's runtime by typing in the command line:
time [funtion_name]
if you run only the function in main() (assuming C++), the rest of the app's time should be negligible.
I repeat the function call a lot of times (millions) but also employ the following method to discount the loop overhead:
start = getTicks();
repeat n times {
myFunction();
myFunction();
}
lap = getTicks();
repeat n times {
myFunction();
}
finish = getTicks();
// overhead + function + function
elapsed1 = lap - start;
// overhead + function
elapsed2 = finish - lap;
// overhead + function + function - overhead - function = function
ntimes = elapsed1 - elapsed2;
once = ntimes / n; // Average time it took for one function call, sans loop overhead
Instead of calling function() twice in the first loop and once in the second loop, you could just call it once in the first loop and don't call it at all (i.e. empty loop) in the second, however the empty loop could be optimized out by the compiler, giving you negative timing results :)
start_time = timer
function()
exec_time = timer - start_time
Windows XP/NT Embedded or Windows CE/Mobile
You an use the QueryPerformanceCounter() to get the value of a VERY FAST counter before and after your function. Then you substract those 64-bits values and get a delta "ticks". Using QueryPerformanceCounterFrequency() you can convert the "delta ticks" to an actual time unit. You can refer to MSDN documentation about those WIN32 calls.
Other embedded systems
Without operating systems or with only basic OSes you will have to:
program one of the internal CPU timers to run and count freely.
configure it to generate an interrupt when the timer overflows, and in this interrupt routine increment a "carry" variable (this is so you can actually measure time longer than the resolution of the timer chosen).
before your function you save BOTH the "carry" value and the value of the CPU register holding the running ticks for the counting timer you configured.
same after your function
substract them to get a delta counter tick.
from there it is just a matter of knowing how long a tick means on your CPU/Hardware given the external clock and the de-multiplication you configured while setting up your timer. You multiply that "tick length" by the "delta ticks" you just got.
VERY IMPORTANT Do not forget to disable before and restore interrupts after getting those timer values (bot the carry and the register value) otherwise you risk saving incorrect values.
NOTES
This is very fast because it is only a few assembly instructions to disable interrupts, save two integer values and re-enable interrupts. The actual substraction and conversion to real time units occurs OUTSIDE the zone of time measurement, that is AFTER your function.
You may wish to put that code into a function to reuse that code all around but it may slow things a bit because of the function call and the pushing of all the registers to the stack, plus the parameters, then popping them again. In an embedded system this may be significant. It may be better then in C to use MACROS instead or write your own assembly routine saving/restoring only relevant registers.
Depends on your embedded platform and what type of timing you are looking for. For embedded Linux, there are several ways you can accomplish. If you wish to measure the amout of CPU time used by your function, you can do the following:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define SEC_TO_NSEC(s) ((s) * 1000 * 1000 * 1000)
int work_function(int c) {
// do some work here
int i, j;
int foo = 0;
for (i = 0; i < 1000; i++) {
for (j = 0; j < 1000; j++) {
for ^= i + j;
}
}
}
int main(int argc, char *argv[]) {
struct timespec pre;
struct timespec post;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &pre);
work_function(0);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &post);
printf("time %d\n",
(SEC_TO_NSEC(post.tv_sec) + post.tv_nsec) -
(SEC_TO_NSEC(pre.tv_sec) + pre.tv_nsec));
return 0;
}
You will need to link this with the realtime library, just use the following to compile your code:
gcc -o test test.c -lrt
You may also want to read the man page on clock_gettime there is some issues with running this code on SMP based system that could invalidate you testing. You could use something like sched_setaffinity() or the command line cpuset to force the code on only one core.
If you are looking to measure user and system time, then you could use the times(NULL) which returns something like a jiffies. Or you can change the parameter for clock_gettime() from CLOCK_THREAD_CPUTIME_ID to CLOCK_MONOTONIC...but be careful of wrap around with CLOCK_MONOTONIC.
For other platforms, you are on your own.
Drew
I always implement an interrupt driven ticker routine. This then updates a counter that counts the number of milliseconds since start up. This counter is then accessed with a GetTickCount() function.
Example:
#define TICK_INTERVAL 1 // milliseconds between ticker interrupts
static unsigned long tickCounter;
interrupt ticker (void)
{
tickCounter += TICK_INTERVAL;
...
}
unsigned in GetTickCount(void)
{
return tickCounter;
}
In your code you would time the code as follows:
int function(void)
{
unsigned long time = GetTickCount();
do something ...
printf("Time is %ld", GetTickCount() - ticks);
}
In OS X terminal (and probably Unix, too), use "time":
time python function.py
If the code is .Net, use the stopwatch class (.net 2.0+) NOT DateTime.Now. DateTime.Now isn't updated accurately enough and will give you crazy results
If you're looking for sub-millisecond resolution, try one of these timing methods. They'll all get you resolution in at least the tens or hundreds of microseconds:
If it's embedded Linux, look at Linux timers:
http://linux.die.net/man/3/clock_gettime
Embedded Java, look at nanoTime(), though I'm not sure this is in the embedded edition:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#nanoTime()
If you want to get at the hardware counters, try PAPI:
http://icl.cs.utk.edu/papi/
Otherwise you can always go to assembler. You could look at the PAPI source for your architecture if you need some help with this.

Resources