Using fortran RANDOM_SEED in parallel MPI - random

I am trying to use fortran intrinsic PRNG in a MPI code.
I understand from this link that GFortran implement the PRNG using xorshift1024* which has a period of 2^1024 - 1. It also says:
Note that in a multi-threaded program (e.g. using OpenMP directives),
each thread will have its own random number state.
Then reading this I found:
When a new thread uses
RANDOM_NUMBER for the first time, the seed is copied from the master
seed, and forwarded N * 2^512 steps to guarantee that the random
stream does not alias any other stream in the system, where N is the
number of threads that have used RANDOM_NUMBER so far during the
program execution
If this is an automatic feature of GFortran, it works only in OpenMP? What if I want to have parallel PRNG using MPI? How can I ensure the portability of the code to other compilers?
In other words: Is there any way to do what GFortran says it does (i.e. guarantee real parallel PRNG) in a portable way using the fortran intrinsic instructions?
NOTE: I was using the PRNG of Numerical Recipes in MPI. That worked well for some years, but now I am getting some errors in some assumptions on the integer model that Numerical Recipes says goes beyond fortran... so I don't see how to solve that and that is way I want to use the intrinsic PRNG if is possible.

Note that the use of xorshoft1024* is a very new feature in GFortran, it's only available in the development trunk version, no released version has it yet at the time of writing this. It will be released as part of GCC 7, probably in spring 2017.
So when you're using MPI, each MPI rank is a separate process and the random number generators in each process is completely separate with no communication between the PRNG's in different processes (unless you handle it yourself with MPI, of course). The thing with forwarding the PRNG stream 2^512 steps happens only when using the PRNG from multiple threads within the same process.
That being said, xorshift1024* has a fairly long period (2^1024-1), and the first time the PRNG is used in a process (again, think MPI rank) it is initialized with random data from the OS (/dev/urandom on POSIX systems), unless it has already been explicitly initialized with RANDOM_SEED. So in practice I think you'll be fine, it's exceedingly unlikely that the PRNG streams for different MPI ranks will alias.
And no, the above describes the PRNG in GFortran version 7. If you want something portable you cannot rely on anything beyond what the standard guarantees. Beyond the parallel aspects, for portable high quality random numbers you're probably better of with using a known good PRNG rather than relying on the one provided by the compiler (I have personal experience of at least one compiler producing poor quality random numbers with the RANDOM_NUMBER intrinsic, but I'll refrain from naming the vendor since it was many years ago and they might have fixed it since, if they are even in business anymore, I don't know).
(If you find the semantics of the new xorshift1024* implementation difficult, blame a) me, since I devised it and implemented it b) the Fortran standard which makes it impossible to have a parallel PRNG with simple semantics)

If you want to have a portable version of a multi-stream random number generator for a Fortran program, there is a multi-stream Fortran version of the Mersenne Twister. See http://theo.phys.sci.hiroshima-u.ac.jp/~ishikawa/PRNG/mt_stream_en.html . It uses the concept of advancing the PRNG by a very large number of steps for the different threads. It's setup and configured by subroutine calls so you should be able to use it from various multi-threading environments.

Related

Java vs. Julia: Differences in JIT Compiling and Resulting Performance

I've recently started reading about JIT compilation. On another note, I've read that well-written Julia code often performs on-par with statically compiled languages (see, e.g., paragraph 2 of the introduction section of the Julia docs) while I've recurrently heard Java often does not. Why is that?
On the surface, they seem to have in common that they both run JIT-compiled bytecode in a VM (although I am aware that Java dynamically infers which code to JIT). While I can rationalize the performance difference in Julia vs. (purely) interpreted languages like (vanilla) Python, how come two JIT-compiled languages have such different reputations for performance? Speaking of performance, I am particularly referring to scientific computing applications.
Please note that this question is intentionally phrased broadly. I feel like its possible answers could give me insights into what defines fast Julia code, given the way Julia's compiler works in comparison to other JIT compiled languages.
While AFAIK there is currently one implementation of Julia, there are several implementations of Java and not all behave the same nor use the same technics internally. Thus it does not mean much to compare languages. For example, GCJ is a GNU compiler meant to compile Java codes to native ones (ie. no JIT nor bytecode). It is now a defunct project since the open-source JIT-based implementations super-seeded this project (AFAIK even performance-wise).
The primary reference Java VM implementation is HotSpot (made by Oracle). The JIT of HotSpot use an adaptative strategy for compiling functions so to reduce the latency of the compilation. The code can be interpreted for a short period of time and if it is executed many times, then the JIT use more aggressive optimizations with multiple levels. As a result hot loops are very aggressively optimized while glue code executed once is mostly interpreted. Meanwhile, Julia is based on the LLVM compiler stack capable of producing very efficient code (it is used by Clang which is a compiler used to compile C/C++ code to native one), but it is also not yet very well suited for very dynamic codes (it works but the latency is pretty big compared to other existing JIT implementations).
The thing is Java and Julia target different domains. Java is used for example on embedded systems where latency matters a lot. It is also use for GUI applications and Web servers. Introducing a high latency during the execution is not very reasonable. This is especially why Java implementation spent a lot of time in the past so to optimize the GC (Garbage Collector) in order to reduce the latency of collections. Julia mainly target HPC/scientific applications that do not care much about latency. The main goal of Julia is to minimize the wall-clock time and not the responsiveness of the application.
I've read that well-written Julia code often performs on-par with statically compiled languages
Well, optimizing JITs like the one of Julia or the one of HotSpot are very good nowadays to compile scalar codes in hot loops. Their weakness lies in the capability to apply high-level expensive computations. For example, optimizing compilers like ICC/PGI can use the polyhedral model so to completely rewrite loops and vectorize them efficiently using SIMD instructions. This is frequent in HPC (numerically intensive) applications but very rare in embedded/Web/GUI ones. The use of the best specific instructions on the available platform is not always great in most JIT implementations (eg. bit operations) though the situation is rapidly improving. On the other hand, JIT can outperform static compilers by using runtime informations. For example, they can assume a value is a constant and optimize expressions based on that (eg. a runtime-dependent stride of 1 of a multi-dimensional array do not need additional multiplications). Still, static compilers can do similar optimisation with profile-guided optimizations (unfortunately rarely used in practice).
However, there is a catch: languages likes C/C++ compiled natively have access to lower-level features barely available in Java. This is a bit different in Julia since the link with native language code is easier and inline assembly is possible (as pointed out by #OscarSmith) enabling skilled developers to write efficient wrappers. Julia and Java use a GC that can speed up a bit some unoptimized codes but also slow down a lot some others (typically code manipulating big data-structures with a lot of references likes trees and graphs, especially in parallel codes). This is why a C/C++ code can significantly outperform a Julia/Java code. While JIT implementations can sometime (but rarely) outperform static C/C++ compilers, no compilers are perfect and there are case where nearly all implementations perform poorly. The use of (low-level) intrinsics enable developers to implement possibly faster codes at the expense of a lower portability and a higher complexity. This is not rare for SIMD code since auto-vectorization is far from being great so far. More generally, the access to lower-level features (eg. operating system specific functions, parallel tools) help to write faster codes for skilled programmers.
Chosen algorithms and methods matters often far more than the target language implementation. The best algorithm/method in one language implementation may not be the best in another. Two best algorithms/methods of two different implementation are generally hard to compare (it is fair to compare only the performance of codes if one is is nearly impossible to maintain and is very hard/long to write without bugs?). This is partially why comparing language implementation is so difficult, even on a specific problem to solve.
(purely) interpreted languages like (vanilla) Python
While the standard implementation of Python is the CPython interpreter, there are fast JIT for Python like PyPy or Pyston.
Speaking of performance, I am particularly referring to scientific computing applications
Note that scientific computing applications is still quite broad. While physicist tends to write heavily numerically intensive applications operating on large contiguous arrays where the use of multiple threads and SIMD instruction is critical, biologist tends to write codes requiring very different optimizations. For example, genomic codes tends to do a lot of string matching operations. They also often make use of complex data-structures/algorithms (eg. phylogenetic tree, compression).
Some Java features like boxing are performance killers for such applications though there are often complex way to mitigate their cost.
You may be interested by this famous language benchmark:
Julia VS C-GCC (one can see that Julia and Java are slow for binary trees, as expected, certainly due to the GC, though the Java's GC is more efficient at the expense of a bigger memory usage)
Julia VS Java-OpenJDK
C-GCC VS C-Clang
As you can see in the benchmark, the fastest implementations are generally the more-complex and/or bigger ones using the best algorithms and lower-level methods/tricks.

How does random_number() work in parallel?

How does random_number() works in parallel with OpenMP?
If I run my program without parallelization I always get the same result, but with parallelization I get different (but similar) results every time.
There is no guarantee about thread safety or threading performance about random_number in general. The Fortran standard does not know OpenMP at all.
Individual compilers may offer you some guarantees, but they will be only valid for the version present in the particular compiler. For example, the current gfortran version supplies a thread-safe random number generator and "Note that in a multi-threaded program (e.g. using OpenMP directives), each thread will have its own random number state." Other compilers may differ. Notably, the compiler your user may want to use may differ and you may not know about that.
There are dedicated parallel random number generators available. For example, I use a modified version of the library that uses the Ziggurat method for several random number distributions, was parallelized by Gib Bogle and I added the implementation of xoroshiro128+ as the underlying algorithm, similar to the one used by Gfortran. There are other implementations of similar algorithms available and standard C++ contains some new generators which are actually defined to use a specific algorithm, so you could call them.
If your goal is to have a reproducible random numbers, take a look at this answer: https://stackoverflow.com/a/52884455/12845922
It's in C, but gives you an effective way to get reproducible results for any number of threads that could easily be converted to Fortan.

MPI and message passing in Julia

I never used MPI before and now for my project in Julia I need to learn how to write my code in MPI and have several codes with different parameters run in parallel and from time to time send some data from each calculation to the other ones.
And I am absolutely blank how to do this in Julia and I never did it in any language before. I installed library MPI but didn't find good tutorial or documentation or an available example for that.
There are different ways to do parallel programming with Julia.
If your problem is very simply, then it might sufficient to use parallel for loops and shared arrays:
https://docs.julialang.org/en/v1/manual/parallel-computing/
Note however, you cannot use multiple computing nodes (such as a cluster) in this case.
To me, the other native constructs in Julia are difficult to work with for more complex programs and in my case, I needed to restructure (significantly) my serial code to use them.
The advantage of MPI is that you will find a lot of documentation of doing MPI-style (single-program, multiple-data) programming in general (but not necessarily documentation specific to julia). You might find the MPI style also more obvious.
On a large cluster it is also possible that you will find optimized MPI libraries.
A good starting points are the examples distributed with MPI.jl:
https://github.com/JuliaParallel/MPI.jl/tree/master/examples

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

Fastest math programming language?

I have an application that requires millions of subtractions and remainders, i originally programmed this algorithm inside of C#.Net but it takes five minutes to process this information and i need it faster than that.
I have considered perl and that seems to be the best alternative now. Vb.net was slower in testing. C++ may be better also. Any advice would be greatly appreciated.
You need a compiled language like Fortran, C, or C++. Other languages are designed to give you flexibility, object-orientation, or other advantages, and assume absolutely fastest performance is not your highest priority.
Know how to get maximum performance out of a single thread, and after you have done so investigate sharing the work across multiple cores, for example with MPI. To get maximum performance in a single thread, one thing I do is single-step it at the machine instruction level, to make sure it's not dawdling about in stuff that could be removed.
Some calculations are regular enough to take profit of GPGPUs: recent graphic cards are essentially specialized massively parallel numerical co-processors. For instance, you could code your numerical kernels in OpenCL. Otherwise, learn C++11 (not some earlier version of the C++ standard) or C. And in many cases Ocaml could be nearly as fast as C++ but much easier to code with.
Perhaps your problem can be handled by scilab or R, I did not understand it enough to help more.
And you might take advantage of your multi-core processor by e.g. using Pthreads or MPI
At last, the Linux operating system is perhaps better to deal with massive calculations. It is significant that most super computers use it today.
If execution speed is the highest priority, that usually means Fortran.
Try Julia: its killing feature is being easy to code in a high level concise way, while keeping performances at the same order of magnitude of Fortran/C.
PARI/GP is the best I have used so far. It's written in C.
Try to look at DMelt mathematical program. The program calls Java libraries. Java virtual machine can optimize long mathematical calculations for you.
The standard tool for mathmatic numerical operations in engineering is often Matlab (or as free alternatives octave or the already mentioned scilab).

Resources