How does random_number() work in parallel? - parallel-processing

How does random_number() works in parallel with OpenMP?
If I run my program without parallelization I always get the same result, but with parallelization I get different (but similar) results every time.

There is no guarantee about thread safety or threading performance about random_number in general. The Fortran standard does not know OpenMP at all.
Individual compilers may offer you some guarantees, but they will be only valid for the version present in the particular compiler. For example, the current gfortran version supplies a thread-safe random number generator and "Note that in a multi-threaded program (e.g. using OpenMP directives), each thread will have its own random number state." Other compilers may differ. Notably, the compiler your user may want to use may differ and you may not know about that.
There are dedicated parallel random number generators available. For example, I use a modified version of the library that uses the Ziggurat method for several random number distributions, was parallelized by Gib Bogle and I added the implementation of xoroshiro128+ as the underlying algorithm, similar to the one used by Gfortran. There are other implementations of similar algorithms available and standard C++ contains some new generators which are actually defined to use a specific algorithm, so you could call them.

If your goal is to have a reproducible random numbers, take a look at this answer: https://stackoverflow.com/a/52884455/12845922
It's in C, but gives you an effective way to get reproducible results for any number of threads that could easily be converted to Fortan.

Related

MPI and message passing in Julia

I never used MPI before and now for my project in Julia I need to learn how to write my code in MPI and have several codes with different parameters run in parallel and from time to time send some data from each calculation to the other ones.
And I am absolutely blank how to do this in Julia and I never did it in any language before. I installed library MPI but didn't find good tutorial or documentation or an available example for that.
There are different ways to do parallel programming with Julia.
If your problem is very simply, then it might sufficient to use parallel for loops and shared arrays:
https://docs.julialang.org/en/v1/manual/parallel-computing/
Note however, you cannot use multiple computing nodes (such as a cluster) in this case.
To me, the other native constructs in Julia are difficult to work with for more complex programs and in my case, I needed to restructure (significantly) my serial code to use them.
The advantage of MPI is that you will find a lot of documentation of doing MPI-style (single-program, multiple-data) programming in general (but not necessarily documentation specific to julia). You might find the MPI style also more obvious.
On a large cluster it is also possible that you will find optimized MPI libraries.
A good starting points are the examples distributed with MPI.jl:
https://github.com/JuliaParallel/MPI.jl/tree/master/examples

Using fortran RANDOM_SEED in parallel MPI

I am trying to use fortran intrinsic PRNG in a MPI code.
I understand from this link that GFortran implement the PRNG using xorshift1024* which has a period of 2^1024 - 1. It also says:
Note that in a multi-threaded program (e.g. using OpenMP directives),
each thread will have its own random number state.
Then reading this I found:
When a new thread uses
RANDOM_NUMBER for the first time, the seed is copied from the master
seed, and forwarded N * 2^512 steps to guarantee that the random
stream does not alias any other stream in the system, where N is the
number of threads that have used RANDOM_NUMBER so far during the
program execution
If this is an automatic feature of GFortran, it works only in OpenMP? What if I want to have parallel PRNG using MPI? How can I ensure the portability of the code to other compilers?
In other words: Is there any way to do what GFortran says it does (i.e. guarantee real parallel PRNG) in a portable way using the fortran intrinsic instructions?
NOTE: I was using the PRNG of Numerical Recipes in MPI. That worked well for some years, but now I am getting some errors in some assumptions on the integer model that Numerical Recipes says goes beyond fortran... so I don't see how to solve that and that is way I want to use the intrinsic PRNG if is possible.
Note that the use of xorshoft1024* is a very new feature in GFortran, it's only available in the development trunk version, no released version has it yet at the time of writing this. It will be released as part of GCC 7, probably in spring 2017.
So when you're using MPI, each MPI rank is a separate process and the random number generators in each process is completely separate with no communication between the PRNG's in different processes (unless you handle it yourself with MPI, of course). The thing with forwarding the PRNG stream 2^512 steps happens only when using the PRNG from multiple threads within the same process.
That being said, xorshift1024* has a fairly long period (2^1024-1), and the first time the PRNG is used in a process (again, think MPI rank) it is initialized with random data from the OS (/dev/urandom on POSIX systems), unless it has already been explicitly initialized with RANDOM_SEED. So in practice I think you'll be fine, it's exceedingly unlikely that the PRNG streams for different MPI ranks will alias.
And no, the above describes the PRNG in GFortran version 7. If you want something portable you cannot rely on anything beyond what the standard guarantees. Beyond the parallel aspects, for portable high quality random numbers you're probably better of with using a known good PRNG rather than relying on the one provided by the compiler (I have personal experience of at least one compiler producing poor quality random numbers with the RANDOM_NUMBER intrinsic, but I'll refrain from naming the vendor since it was many years ago and they might have fixed it since, if they are even in business anymore, I don't know).
(If you find the semantics of the new xorshift1024* implementation difficult, blame a) me, since I devised it and implemented it b) the Fortran standard which makes it impossible to have a parallel PRNG with simple semantics)
If you want to have a portable version of a multi-stream random number generator for a Fortran program, there is a multi-stream Fortran version of the Mersenne Twister. See http://theo.phys.sci.hiroshima-u.ac.jp/~ishikawa/PRNG/mt_stream_en.html . It uses the concept of advancing the PRNG by a very large number of steps for the different threads. It's setup and configured by subroutine calls so you should be able to use it from various multi-threading environments.

swap two variables. which way is faster?

Let's say we have two integers a and b. which way is faster for swapping their values?
c=a;
a=b;
b=c;//(edited typo)
or
a=a+b;
b=a-b;
a=a-b;
or bitwise xor
a=a^b;
b=a^b;
a=a^b;
I'll test its performance differences when I'll be able but I'd like to know it now. Is it bitwise?
Firstly, you cannot quantify the speed of an algorithm independent of the program language, the compiler and the platform on which it is run. An algorithm is a mathematical abstraction.
Having said that:
for a typical programming language,
and a typical compiler, and
a typical execution platform,
the first version will typically be faster because it will typically compile to fewer native instructions that take less clock cycles to execute. The first version only requires load and save operations. The other two versions have (at least) the same number of loads and saves, and some additional arithmetic or bit manipulation instructions.
However, even that is not cut-and-dry.
The 2nd and 3rd examples are performing the swap without using a temporary variable. This is something you might do if using an extra temporary variable was expensive. This might happen on a machine which didn't provide enough general purpose registers, and the relative cost of loading / saving to memory was large. In some circumstances, the native code equivalents could be optimal.
However ... and this is the real point ... the best strategy is to leave this kind of decision to the compiler. Unless you are prepared to put a huge amount of effort into micro-optimizing, the compiler is likely to be able to a better job than you can. Indeed, writing code in "cunning ways" is liable to make it harder for the compiler to optimize. (In the 3rd case for example, the compiler would need to figure out that that sequence is actually swapping 2 variables before it can substitute the optimal instruction sequence. Chances are that the optimizer won't be able to do that.)

What do we need to define while using parallel optimization flag?

I have a program with more than 100 subroutines and I am trying to make this code to run faster and I am trying to compile these subroutines using parallel flag. I was wondering what variable or parameters do I need to define in the program if I want to use the parallel flag. Just using the parallel optimization flag increased the run time for my program compared to the one without parallel flag.
Any suggestions is highly appreciated. Thanks a lot.
Best Regards,
Jdbaba
I can give you some general guidelines, but without knowing your specific compiler and platform/OS I won't be able to help you specifically. As far as I know, all of the autoparallelization schemes that are used in Fortran compilers end up using either OpenMP or MPI commands to split the loops out into either threads or processes. The issue is that there is a certain amount of overhead associated with those schemes. For instance, in one case I had a program that used an optimization library which was provided by a vendor as a compiled library without optimization within it. As all of my subroutines and functions were either outside or inside the large loop of the optimizer, and since there was only object data, the autoparallelizer wasn't able to perform ipo and as such it failed to use more than the one core. The run times in this case, due to the DLL that was loaded for OpenMP, the /qparallel actually added ~10% to the run time.
As a note, autoparallelizers aren't magic. Essentially all they are doing is the same type of thing that the autovectorization techniques do, which is to look for loops that have no data that are dependent upon the previous iteration. If it detects that variables are changed between iterations or if the compiler can't tell, then it will not attempt to parallelize the loop.
If you are using the Intel Fortran compiler, you can turn on a diagnostic switch "/qpar-report3" or "-par-report3" to give you information as to the dependency tree of loops to see why they failed to optimize. If you don't have access to large sections of the code you are using, in particular parts with major loops, there is a good chance that there won't be much opportunity in your code to use the auto-parallelizer.
In any case, you can always attempt to reduce dependencies and reformulate your code such that it is more friendly to autoparallelization.

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

Resources