L'Ecuyer's MRG32k3a random number generator in CUDA C?

L'Ecuyer's MRG32k3a random number generator in CUDA C? - random

I am facing trouble implementing this. I want to run some monte carlo simulations on gpu cluster and require the Random Number Generator to work for it. I want to print the random numbers generated. Could someone please explain with an example? I am working in unix.
Thanks!

It's available since CUDA 4.1 with cuRAND library .

Related

Using fortran RANDOM_SEED in parallel MPI

I am trying to use fortran intrinsic PRNG in a MPI code.
I understand from this link that GFortran implement the PRNG using xorshift1024* which has a period of 2^1024 - 1. It also says:
Note that in a multi-threaded program (e.g. using OpenMP directives),
each thread will have its own random number state.
Then reading this I found:
When a new thread uses
RANDOM_NUMBER for the first time, the seed is copied from the master
seed, and forwarded N * 2^512 steps to guarantee that the random
stream does not alias any other stream in the system, where N is the
number of threads that have used RANDOM_NUMBER so far during the
program execution
If this is an automatic feature of GFortran, it works only in OpenMP? What if I want to have parallel PRNG using MPI? How can I ensure the portability of the code to other compilers?
In other words: Is there any way to do what GFortran says it does (i.e. guarantee real parallel PRNG) in a portable way using the fortran intrinsic instructions?
NOTE: I was using the PRNG of Numerical Recipes in MPI. That worked well for some years, but now I am getting some errors in some assumptions on the integer model that Numerical Recipes says goes beyond fortran... so I don't see how to solve that and that is way I want to use the intrinsic PRNG if is possible.

Note that the use of xorshoft1024* is a very new feature in GFortran, it's only available in the development trunk version, no released version has it yet at the time of writing this. It will be released as part of GCC 7, probably in spring 2017.
So when you're using MPI, each MPI rank is a separate process and the random number generators in each process is completely separate with no communication between the PRNG's in different processes (unless you handle it yourself with MPI, of course). The thing with forwarding the PRNG stream 2^512 steps happens only when using the PRNG from multiple threads within the same process.
That being said, xorshift1024* has a fairly long period (2^1024-1), and the first time the PRNG is used in a process (again, think MPI rank) it is initialized with random data from the OS (/dev/urandom on POSIX systems), unless it has already been explicitly initialized with RANDOM_SEED. So in practice I think you'll be fine, it's exceedingly unlikely that the PRNG streams for different MPI ranks will alias.
And no, the above describes the PRNG in GFortran version 7. If you want something portable you cannot rely on anything beyond what the standard guarantees. Beyond the parallel aspects, for portable high quality random numbers you're probably better of with using a known good PRNG rather than relying on the one provided by the compiler (I have personal experience of at least one compiler producing poor quality random numbers with the RANDOM_NUMBER intrinsic, but I'll refrain from naming the vendor since it was many years ago and they might have fixed it since, if they are even in business anymore, I don't know).
(If you find the semantics of the new xorshift1024* implementation difficult, blame a) me, since I devised it and implemented it b) the Fortran standard which makes it impossible to have a parallel PRNG with simple semantics)

If you want to have a portable version of a multi-stream random number generator for a Fortran program, there is a multi-stream Fortran version of the Mersenne Twister. See http://theo.phys.sci.hiroshima-u.ac.jp/~ishikawa/PRNG/mt_stream_en.html . It uses the concept of advancing the PRNG by a very large number of steps for the different threads. It's setup and configured by subroutine calls so you should be able to use it from various multi-threading environments.

parallel matrix multiplication using PBLAS

I have searched a lot of websites and resources but couldn't find any C or FORTRAN code example of parallel matrix multiplication using PBLAS PDGEMM function, could you please help me to find such resources.
Thank you in advance.
I have got an example of pblas.tar.gz from netlib website, did the make and executed it on Linux cluster using mpi but the program is executing the same run on all nodes without splitting the matrices.

A classic case would be the ScaLAPACK software and associated examples, such as http://www.netlib.org/scalapack/examples/example1.f
In case you misunderstood, PDGEMM will not "split the matrices", it expects the input data to already be distributed properly (i.e., 2D block-cyclic distribution).

How can I use CURAND_RNG_QUASI_SOBOL32 generator using device API? CUDA

A snippet of my task is to generate, let's say, 256 quasi-random numbers using CUDA. I've read cuRAND docs and from there I've learnt that I need to use a set of direction vectors, which I can get using curandGetDirectionVectors32 function. But the problem is that I still can not understand what is 'set of direction vectors'. Especially how to use it, how to limit its length etc.
Also there's no example in Device API Examples with Sobol's generator. And there's no working example in google. I've found some explanation but scrambled_sobol_v_host is not declared in that scope and unclear for me.
So, my question is could anyone, please, provide me with tiny working example of usage of this generator?
And I have troubles with understanding difference between Sobol's generator and scrambled Sobol's generator.
Thank you in advance.

Direction vectors are the seeding method for that number generator. For implementation you should be able to follow using QuasirandomGenerator (for dummies)

sparse matrices solver for f90

I am dealing with up to N=10^7 x N=10^7 matrices; number of nonzero elements is about 6 x N. (Those elements are grouped around diagonal.) My RAM has 16 Gbt size; so I clearly need sparse matrix solver. I run Ubuntu LINUX, and use fortran90 (gfortran), or precisely speaking, ratfor90.
I have LAPACK, but it doesn't seem to support sparse matrix solving.
(am I wrong with that?) MATLAB must be good, but I don't want to spend much time to get familiar with it; the time is pressing. I have old/gold slatec installed and use it for spec. functions; does it have sparse matrix routins?
I hear about ARPACK, but can it be used as a plain solver? could it be called from gfortran?
Any other suggestion?
Thanks, -- Alex

You are right. Lapack is not applicable to this problem.
Direct Sparse solvers are provided by MUMPS, UMFPACK, SuperLU libraries.
Also PETSc is a library collection where you can find a lot of information
You can find Ubuntu package available for all these libraries.

ARPACK is a package that solves eigenvalue problems, but it is not a solver by itself.
I am not sure you can solve your problem on 16 Gb. I recommend having a look at freefem++

Integrating ODEs on the GPU using boost and python

I posted here not too long ago about a model I am trying to build using pycuda which solves About 9000 coupled ODEs. My model is too slow however and an SO member suggested that memory transfers from host to GPU is probably the culprit.
Right now cuda is being used only to calculate the rate of change of each of the 9000 species I am dealing with. Since I am passing in an array from the host to the GPU to perform this calculation and returning an array from the GPU to integrate on the host I can see how this would slow things down.
Would boost be the solution to my problem? From what I read, boost allows interoperability between c++ and python. It also includes c++ odeint , which I read, partnered with thrust allows quick reduction and integration all on the GPU. Is my understanding correct?
Thank you,
Karsten

Yes, boost.odeint and boost.python should solve your problem. You can use odeint with Thrust. There are also some OpenCL libraries (VexCL, ViennaCL) which might be easier to use then Thrust. Have a look at thist paper for a comparions and for use cases of odeint on GPUs.
Boost.python can do the communication between the C++ application and Python. Another approach would be a very slim command line application for solving the ODE (using boost.odeint) and which is entirely controlled by your python application.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

L'Ecuyer's MRG32k3a random number generator in CUDA C? - random

I am facing trouble implementing this. I want to run some monte carlo simulations on gpu cluster and require the Random Number Generator to work for it. I want to print the random numbers generated. Could someone please explain with an example? I am working in unix. Thanks!

It's available since CUDA 4.1 with cuRAND library .

Related

Using fortran RANDOM_SEED in parallel MPI

parallel matrix multiplication using PBLAS

How can I use CURAND_RNG_QUASI_SOBOL32 generator using device API? CUDA

sparse matrices solver for f90

Integrating ODEs on the GPU using boost and python

Categories

Resources