Difference between OpenMP threadprivate and private

Difference between OpenMP threadprivate and private - parallel-processing

I am trying to parallelize a C program using OpenMP.
I would like to know more about:
The differences between the threadprivate directive and the private clause and
In which cases we must use any of them.
As far as I know, the difference is the global scope with threadprivate and the preserved value across parallel regions. I found in several examples that when a piece of code contains some global/static variables that must be privatized, these variables are included in a threadprivate list and their initial values are copied into the private copies using copyin.
However, is there any rule that prevents us to use the private clause to deal with global/static variables? perhaps any implementation detail?
I couldn't find any explanation in the OpenMP3.0 specification.

The most important differences you have to memorize:
A private variable is local to a region and will most of the time be placed on the stack. The lifetime of the variable's privacy is the duration defined of the data scoping clause. Every thread (including the master thread) makes a private copy of the original variable (the new variable is no longer storage-associated with the original variable).
A threadprivate variable on the other hand will be most likely placed in the heap or in the thread local storage (that can be seen as a global memory local to a thread). A threadprivate variable persist across regions (depending on some restrictions). The master thread uses the original variable, all other threads make a private copy of the original variable (the master variable is still storage-associated with the original variable).
There are also more tricky differences:
Variables defined as private are undefined for each thread upon entering the construct and the corresponding shared variable is undefined when the parallel construct is exited; the initial status of a private pointer is undefine.
But data in the threadprivate common blocks should be assumed to be undefined on entry to the first parallel region unless a copyin clause is specified. When a common block appears in a threadprivate directive, each thread copy is initialized once prior to its first use.
The OpenMP Specifications (section 2.14.2) actually give a very good description (and also more detailled) of the threadprivate directive:
Each copy of a threadprivate variable is initialized once, in the manner specified by the program, but at an unspecified point in the program prior to the first reference to that copy. The storage of all copies of a threadprivate variable is freed according to how static variables are handled in the base language, but at an unspecified point in the program.
A program in which a thread references another thread’s copy of a threadprivate variable is non-conforming.
The content of a threadprivate variable can change across a task scheduling point if the executing thread switches to another task that modifies the variable. For more details on task scheduling, see Section 1.3 on page 14 and Section 2.11 on page 113.
In parallel regions, references by the master thread will be to the copy of the variable in the thread that encountered the parallel region.
During a sequential part references will be to the initial thread’s copy of the variable. The values of data in the initial thread’s copy of a threadprivate variable are guaranteed to persist between any two consecutive references to the variable in the program.
The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all the following conditions hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The thread affinity policies used to execute both parallel regions are the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.
If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.

Related

Does Ada deallocate memory automatically under some circumstances?

I was trying to find some information as to why the keyword new can be used to dynamically allocate objects but there is no keyword like delete that could be used to deallocate them. Going through mentions of Ada.Unchecked_Deallocation in Ada 2012 Reference Manual I found a few interesting excerpts:
Every object is finalized before being destroyed (for example, by
leaving a subprogram_body containing an object_declaration, or by a call to an instance of
Unchecked_Deallocation)
Each access-to-object type has an associated storage pool. The storage allocated by an allocator comes
from the pool; instances of Unchecked_Deallocation return storage to the pool.
The Deallocate procedure of a user-defined storage pool object P may be called by the implementation to
deallocate storage for a type T whose pool is P only at the places when an Allocate call is allowed for P,
during the execution of an instance of Unchecked_Deallocation for T, or as part of the finalization of the
collection of T.
If I had to guess, what that means is that it is possible for an implementation to automatically deallocate an object associated with an access when the execution leaves the scope in which access was declared. No need for explicit calls to Unchecked_Deallocation.
This seems to be supported by a section in Ada 95 Quality and Style Guide which states:
The unchecked storage deallocation mechanism is one method for overriding the default time at which allocated storage is reclaimed. The earliest default time is when an object is no longer accessible, for example, when control leaves the scope where an access type was declared (the exact point after this time is implementation-dependent). Any unchecked deallocation of storage performed prior to this may result in an erroneous Ada program if an attempt is made to access the object.
But the wording is rather unclear. If I were to run this code, what exactly would happen on the memory side of things?
with Ada.Text_IO; use Ada.Text_IO;
procedure Main is
procedure Run is
X : access Integer := new Integer'(64);
begin
Put (Integer'Image (X.all));
end Run;
begin
for I in 1 .. 16 loop
Run;
end loop;
end Main;
with Ada.Text_IO; use Ada.Text_IO;
procedure Main is
procedure Outer is
type Integer_Access is not null access Integer;
procedure Run is
Y : Integer_Access := new Integer'(64);
begin
Put (Integer'Image (Y.all));
end Run;
begin
for I in 1 .. 16 loop
Run;
end loop;
end Outer;
begin
Outer;
end Main;
Is there a guaranteed memory leak or is X deallocated when Run finishes?

As outlined in Memory Management with Ada 2012, cited here, a local variable is typically allocated on a stack; its memory is automatically released when the variable's scope exits. In contrast, a dynamic a variable is typically allocated on a heap; its memory is allocated using new, and its memory must be reclaimed, usually:
Explicitly, e.g. using an instance of Unchecked_Deallocation.
Implicitly, e.g. using a controlled type derived from Finalization; as noted here, when the scope of a controlled instance exits, automatic finalization calls Finalize, which reclaims storage in a manner suitable to the type's design.
The children of Ada.Containers use controlled types internally to encapsulate access values and manage memory automatically. For reference, compare your compiler's implementation of a particular container to the corresponding functional container cited here.
Ada offers a variety of ways to manage memory, summarized on slide 28 in the author's order of preferability:
Stack-based.
Container-based.
Finalization-based.
Subpool-based.
Manual allocate/deallocate.
In the particular case of Main, the program allocates storage for 16 instances of Integer. As noted on slide 12, "A compiler may reclaim allocated memory when the corresponding access type goes out of scope." For example, a recent version of the GNAT reference manual indicates that the following storage management implementation advice is followed:
A storage pool for an anonymous access type should be created at the point of an allocator for the type, and be reclaimed when the designated object becomes inaccessible.
Absent such an indication, the storage is not required to be reclaimed. It is typically reclaimed by the host operating system when the program exits.

Do your programs leak memory? It depends on the compiler.
AFAIK, there are only two times when a compiler is required to reclaim allocated memory:
when an access type with Storage_Size specified goes out of scope
when an instance of Ada.Unchecked_Deallocation is called with a non-null value
However, a compiler is allowed to reclaim memory in other cases. For example, a compiler may implement garbage collection, but I don't know of any that do.
FWIW, I don't know of any compiler for which your programs don't leak memory.

Registers usage during compilation

I found information that general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables, but do they have any other purpose? Also which memory are this registers part of(cache/RAM)?
Finally what does global pointer gp in register r26 points to?

Also which memory are this registers part of(cache/RAM)?
Register are on-processors storage allowing a fast data transfer (2 reads/1 write per cycle). They store variables that can represent memory addresses, but, besides that, are completely unrelated to memory or cache.
I found information that general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables, but do they have any other purpose?
Registers are use with respect to hardware or software conventions. Hardware conventions are related to the instruction set architecture. For instance, the call instruction transfers control to a subroutine and stores return address in register r31 (ra). Very nasty things are likely to happen if you overwrite r31 register by any mean without precautions. Software conventions are supposed to insure a proper behavior if used consistently within software. They indicate which register have special use, which need to be saved when context switching, etc. These conventions can be changed without hardware modifications, but doing so will probably require changes in several software tools (compiler, linker, loader, OS, ...).
general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables
Actually, some registers are reserved.
r1 is used by asm for macro expansion. (sw)
r2-r7 are used by the compiler to pass arguments to functions or get return values. (sw)
r24-r25 can only be used by exception handlers. (sw)
r26-r28 hold different pointers (global, stack, frame) that are set either by the runtime or the compiler and cannot be modified by the programmer.(sw)
r29-r31 are hw coded returns addresses for subprograms or interrupts/exceptions. (hw)
So only r8-r23 can used by the compiler.
but do they have any other purpose?
No, and that's why they can be freely used by the compiler or programmer.
Finally what does global pointer in register r26 points to?
Accessing memory with load or stores have a based memory addressing. Effective address for ldx or stx (where 'x' is is b, bu, h, etc depending on data characteristics) is computed by adding a register and a 16 bits immediate. This only allows to go an an address within +/-32k of the content of register.
If the processor has the address of a var in a register (for instance the value returned by a malloc) the immediate allows to do a displacement to access fields in a struct, next array value, etc.
If the address is local or global, it must be computed by the program. Pointers registers are used to that purpose. Local vars addresses are computed by adding an immediate to the stack pointer (r27or sp).
Addresses of global or static vars are computed by adding an integer to the global pointer (r26 or gp). Content of gp corresponds to the start of the memory data segment and is initialized by the loader just before program execution and must not be modified. The immediate displacement with respect to the start of data segment is computed by the linker when it defines memory layout.
Note that this only allows to access 64k memory because of the 16 bits immediate width. If the size of global/static variables exceeds this value and a var is not within this range, a couple of instructions are required to enter the 32 bits of the address of the var before the data transfer. With gp this is not required and it is a way to provide a faster access to global variables.

When calling an internal subroutine in a parallel Fortran do loop, correct value of iteration variable is not accessible to the subroutine

I attempted to write a Fortran program in which an internal subroutine is called inside a parallel do loop. Because the subroutine is not called anywhere except in this loop, and because the iteration variable i is global, I didn't see the need to pass it to the subroutine. Here's a simplified outline of the program which highlights the problem:
program test
integer :: i
i=37
$omp parallel do private(i)
do i=1,5
call do_work
enddo
$omp end parallel do
contains
subroutine do_work
print *,i
end subroutine do_work
end program test
I'm compiling this program using:
gfortran -O0 -fopenmp -o test test.f90
I compiled it using gfortran 4.4.6 on a machine with 8 cores, and using gfortran 5.4.0 on another machine with 8 cores, and got:
37
37
37
37
37
Of course, when compiled without the -fopenmp flag, I get the expected output:
1
2
3
4
5
So it seems that the pre-loop value of i is what do_work is seeing in every thread. Why does the subroutine not see its thread's local value for i? And why does passing i as an argument to the subroutine resolve the problem? I'm very new to OpenMP, so I apologize if the answer is obvious.

The OpenMP standard does not specify the behaviour of your program.
If you don't pass i as an argument, and you want i to be private to each thread both within the construct (the source that physically appears between the parallel and end parallel directives) and within the region (the source that is executed in between those directives, then you need to give i the OpenMP threadprivate attribute.
Inside the procedure do_work, the variable i is referenced by host association, and, inside the procedure, it does not appear lexically within the OpenMP construct - hence inside the procedure it is a variable that is referenced in a region but not in a construct.
Ordinarily 2.15.1.2 of OpenMP 4.5 specifies that reference to i, in the procedure, would be shared.
But because i is implicitly (because it is a do loop index) and explicitly private within the construct, 2.15.3.3 states that it is unspecified whether references to i in the region but not in the construct are to the original (shared) item or the private copy.
When you pass i as an argument "by reference", the dummy argument has the same data sharing attribute as the actual argument - i.e. if you pass i to the procedure it becomes private.

With OpenMP, when your program enters the do loop, a "thread" is created. This is similar to have a subprogram called by your main program, with the exception that the variables of the main program are available to the subprogram.
The parallel region delimited by the loop will however create copies of the private variables, so that every thread has its own version of i. Your subroutine only sees the i of the "supervisor" program, not the local copy of the threads. When using an explicit argument, the subroutine will be told explicitly to use the "thread-local" value for i.
In general (for OpenMP), it is important to consider carefully what variables are local to the parallel region and what variables can remain "global".

OpenMP Location of Private Variables?

Where do openmp private variables get allocated? On each thread stack, dynamically or through some shared array or something?

The OpenMP specification doesn't specify if those variables are to be allocated on the stack or on the heap (and if they are on the heap if it is in a shared array or if there is one object allocated for each thread). Generally I would assume that private variables are allocated on the stack (there is no reason not to and it's generally more efficient). According to the manual that is the behaviour used in libgomp (the implemention used by gcc) at least, no clue about other implemantations though (although I see little reason why those shouldn't do the same thing).

OpenMP does not specify anything about the allocation of private variables.
There are two options : heap and stack.
If we think about each thread executing less number of instructions, it makes sense for the master thread to allocate private variables like below.
Code :
1: set_threads(n)
2: #pragma omp parallel private(var)
3: {
4: var = ...
5:}
Machine code :
line2 : var_ptr = new variables[n]
line4: var_ptr[get_thread_id()] = ...
But the above code will induce a lot of false-sharing among the private variables in different threads. So I think it would make sense for the compiler to allocate them on the stack of each thread.

Is there a real performance gain when I turn {$IMPORTEDDATA} off?

Is there a real performance gain when I turn {$IMPORTEDDATA} off ?
The manual only says this: "The {$G-} directive disables creation of imported data references. Using {$G-} increases memory-access efficiency, but prevents a packaged unit where it occurs from referencing variables in other packages."
Update:
Here is more info I could find:
"The Debugging section has the new option Use imported data references (mapped to $G), which
controls the creation of imported data references (increasing memory efficiency but preventing the
access of global variables defined in other runtime packages)"

Almost never
This directive only refers to accessing global unit variables from another unit.
If you use {$G+}
unit1;
interface
var
Global1: integer; //<-- this is a global var in unit1.
Form1: TForm1; //<-- also a global var, but really a pointer
Global1 will be accessed indirectly via a pointer (if and when accessed from outside unit1)
Form1 will also be accessed indirectly (i.e. change from a direct pointer to an indirect pointer).
if you use {$G-}, the access to integer global will be direct and thus slightly faster.
This will only make a difference if you use global public unit variables in another unit and in time critical code, i.e. almost never.
See this article: http://hallvards.blogspot.com/2006/09/hack13-access-globals-faster.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio