OpenMP ensemble execution - openmp

I am new to the OpenMP and at the moment with no access to my workstation where I can check the details. Had a quick question to set the basics right before moving on to the hands on part.
Suppose I have a serial program written in FORTRAN90 which evolves a map with iterations and gives the final value of the variable after the evolution, the code looks like:
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
I want to run the same code as independent processes on a cluster for 100 different random initial conditions and see how the output changes with the initial conditions. A serial program for this purpose would look like:
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
Will the OpenMP implementation that I could think of work? The code I could come up with is as follows:
!$ OMP PARALLEL PRIVATE(xi,xf,i)
!$ OMP DO
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
!$ OMP ENDDO
!$ OMP END PARALLEL
Thank you in advance for any suggestions or help.

I think that this line
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
might cause some problems. Is the implementation of random_number on your system thread-safe ? I haven't a clue, I know nothing about your compiler or operating system. If it isn't thread-safe then your program might do a number of things when the OpenMP threads all start using the random number generator; those things include crashing or deadlocking.
If the implementation is thread-safe you will want to figure out how to ensure that the threads either do or don't all generate the same sequence of random numbers. It's entirely sensible to write programs which use the same random numbers in each thread, or that use different sequences in different threads, but you ought to figure out that what you get is what you want.
And if the random number generator is thread safe and generates different sequences for each thread, do those sequences pass the sort of tests for randomness that a single-threaded random number generator might pass ?
It's quite tricky to generate properly independent sequences of pseudo-random numbers in parallel programs; certainly not something I can cover in the space of an SO answer.
While you figure all that out one workaround which might help would be to generate, in a sequential part of your code, all the random numbers you need (into an array perhaps) and let the different threads read different elements out of the array.

I want to run the same code as independent processes on a cluster
Then you do not want OpenMP. OpenMP is about exploiting parallelism inside a single address space.
I suggest you look at MPI, if you want to operate on a cluster

Related

Can reading a variable be a data race in OpenMP?

Why does this OpenMP fortran program work (every element of out is equal to num)? Each thread in the parallel loop might read the variable num simultaneously. I thought this was not acceptable?
program example
implicit none
integer i
integer, parameter :: n = 100000
double precision :: num
double precision, dimension(n) :: out
num = 1.123456789123456789123456d-5
out = 0.d0
!$OMP PARALLEL
!$OMP DO
do i=1,n
out(i) = num
enddo
!$OMP END DO
!$OMP END PARALLEL
do i=1,n
if (out(i).ne.num) print*,'Problem with ',i
enddo
end program
Thanks so much for any insights.
Can reading a variable be a data race in OpenMP?
Any race is between two things happening, so a read can be part of a race. However for the competition between two actions to be a race, there has to be a different outcome depending on the order in which the two actions occur.
Given that the possible actions in a parallel program which we are considering are read and write occurring in different threads, we have four possible cases:
Read, Read: no values are changed, and no code can detect which order the two reads occurred in (at least, not without looking at meta-data such as code performance in a system with caches :-)).
Read, Write: this clearly can be a race; whether the write wins the race or not affects the value which will be read.
Write, Read: as with case 2 (Read,Write), the result seen by the read is affected by the order.
Write, Write: here we have a race too, since we asssume that someone will ultimately read the value, and which value they see will depend on the order of the writes.
So, reading a variable can be part of a race.
However, if your question is really "Is there a race if a variable is only read?", then the answer is "No".
Variables are shared by default in openMP so they are accessible from all the threads. Furthermore, you're not writing to num so even if all the threads were accessing the same memory (which here they probably aren't) there would be no issue.

Can Ruby threads not collide on a write?

From past work in C# and Java, I am accustomed to a statement such as this not being thread-safe:
x += y;
However, I have not been able to observe any collision among threads when running the above code in parallel with Ruby.
I have read that Ruby automatically prevents multiple threads from writing to the same data concurrently. Is this true? Is the += operator therefore thread-safe in Ruby?
Well, it depends on your implementation and a lot of things. In MRI, there is a such thing as the GVL (Giant VM Lock) which controls which thread is actually executing code at a time. You see, in MRI, only one thread can execute Ruby code at a time. So while the C librarys underneath can let another thread run while they use CPU in C code to multiply giant numbers, the code itself can't execute at the same time. That means, a statement such as the assignment might not run at the same time as another one of the assignments (though the additions may run in parallel). The other thing that could be happening is this: I think I heard that assignments to ints are atomic on Linux, so if you're on Linux, that might be something too.
x += 1
is equivalent in every way to
x = x + 1
(if you re-define +, you also automatically redefine the result of +=)
In this notation, it's clearer this is not an atomic operation, and is therefore not guaranteed thread-safe.

Parallelizing an algorithm with many exit points?

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.
I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.
You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}
A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

Allocatable arrays performance

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.
Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
call Z(A,N)
end subroutine
subroutine Z(A,N)
Integer N
real A(N,N)
do stuff here
end
Then this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
do stuff here
end subroutine
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.
Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...
I could think of just these reasons when it comes to fortran performance using arrays:
arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

Thread-safe uniform random number generator

I have some parallel Fortran90 code in which each thread needs to generate the same sequence of random numbers.
I have a random number generator that seems to be thread-unsafe, since, for a given seed, I'm completely unable to repeat the same results each time I run the program.
I surfed unsuccessfully (almost) the entire web looking for some code of a thread-safe RNG. Could anyone provide me with (the link to) the code of one?
Thanks in advance!
A good Pseudorandom number generator for Fortran90 can be found in the Intel Math Kernel Vector Statistical Library. They are thread safe. Also, why does it need to be threadsafe? If you want each thread to get the same list, instantiate a new PRNG for each thread with the same seed.
Most repeatable random number generators need state in some form. Without state, they can't do what comes next. In order to be thread safe, you need a way to hold onto the state yourself (ie, it can't be global).
When you say "needs to generate the same sequence of random numbers" do you mean that
Each thread needs to generate a stream of numbers identical to the other thread? This implies choosing the seed before peeling off threads, then instantiating the a thread-local PRNG in each thread with the same seed.
or
You want to be able to repeat the same sequence of numbers between different runs of the programs, but each thread generates it's own independent sequence? In this case, you still can't share a single PRNG because the thread operation sequence is non-deterministic. So seed a single PRNG with a known seed before launching threads, and use it to generate the initial seeds for the threads. Then you instantiate thread-local generators in each thread...
In each of these cases you should note what Neil Butterworth say about the statistics: most of the usual guarantees that the PRNG like to claim are not reliable when mix streams generated in this way.
In both cases you need a thread-local PRNG. I don't know what is available in f90...but you can also write you own (lookup Mersenne Twister, and write a routne that takes the saved state as a parameter...).
In fortran 77, this would look something like
function PRNGthread (state)
double state(statesize)
c stuff happens here which uses and manipulates the state vector...
PRNGthread = result
return
and each of your threads should maintain a separate state vector, though all will use the same initial value.
I understand you need every thread to produce the same stream of random numbers.
A very good Pseudo Random Generator that will generate a reproducable stream of numbers and is quite fast is the MT19937. Just make sure that you generate the seed before spawning off the threads, but generate a separate instance of the MT in every thread (make the instance of the MT thread local). That way it will be guaranteed that every MT will produce the same stream of numbers.
How about SPRNG? I have not tried it myself though.
I coded a thread-safe Fortran 90 version of the Mersenne Twister/MT19973. The state of the PRNG is saved in a derived type (randomNumberSequence), and you use procedures to seed the generator or get the next element in the sequence.
See http://code.google.com/p/i3rc-monte-carlo-model/source/browse/trunk/Code/RandomNumbersForMC.f95
The alternatives seem to be:
Use a synchronisation object (such as
a mutex) on the generator's seed
value. This will unfortunately
serialise your code on accesses to
generator
Use thread-local storage in the
generator so each thread gets its own
seed - this may cause statstical
problems for your app
If your platform supports a suitable
atomic operation, use that on the
seed (it probably won't, however)
Not a very encouraging list, I know. And to add to it, I have no idea how to implement any of them in FORTRAN!
This article https://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP does not only link to a Fortran implementation, but mentions key points needed to make a PRNG usable with threads. The most important point is:
The Fortran90 version of Ziggurat has several variables and arrays with the 'SAVE' attribute. In order to parallelize the uniform RNG, then, it appears that the required changes are to make these variables arrays with a separate value for each thread (beware of false sharing). Then when the PRNG function is called, we must pass the thread number, and use the corresponding state value.

Resources