I am working on Markov Chain Monte-Carlo (MCMC) algorithm implementation on NVIDIA CUDA GPU.
The CPU MCMC algorithm uses the high quality Mersenne twister random number generator, and I would like to use the same in the GPU kernels I wrote.
I have been searching for cuRand MT code examples for long. Unfortunately, I have never seen any example of a kernel code that uses the Mersenne twister. The standard cuRand library documentation provides a set of functions for MTGP (MT for Graphic Processor), but it is not clear how to use them.
The CUDA Samples provide MersenneTwisterGP11213.tar.gz with an example, but it seems to be exclusively for a host code that requests fast generation of an array of random numbers on GPU, downloads them to CPU memory, and proceeds on CPU.
There is also a paper "Massively Parallel RNG using CUDA C, Thrust and C#". Again, the author in the last section "A Mersenne Twister implementation using CUDA C" provides just a simplified piece of the aforementioned host code from the "CUDA Samples".
So, my first question is: can anybody give me an example of global or device function that uses the cuRand Mersenne twister?
I have one more question. Currently I use a cuRand library random number generator and I have no idea what generator is used! Let me provide a couple pieces of my code. This is the generator initialization:
__global__ void init_rng(Cmcmcfit *mc) {
int ist = threadIdx.x*gridDim.x + blockIdx.x;
if (ist >= mc->nrndst) return; // The last block can have extra threads
unsigned long long offset = 0;
curand_init(mc->seed, ist, offset, &mc->rndst[ist]);
}
In other kernels I sample numbers from the uniform and normal distributions. The array of states for all the blockDim.x*gridDim.x generators is saved in the global memory, array mc->rndst[]. For example, curand_uniform() is used:
. . . . . .
do { /* Randomly select parameter number k to make step */
r = curand_uniform(&mc->rndst[ist]);
k = (int) (mc->nprm*r); /* Random parameter index 0..nprm-1 into ivar[] */
} while (k >= mc->nprm);
. . . . . . . . .
Or, to sample from the Gaussian distribution, curand_normal() is used:
std = mc->pstp[(Nbeta*k + Ibeta)*Nseq + Iseq]; /* pstp[k,ibeta,iseq] */
randn = curand_normal(&mc->rndst[ist]);
p = p + std*randn;
Can anybody tell me which of the cuRand generators (xorwow, lcs, mtgp ...) is used here (actually, by default)?
The curand documentation includes a section on device API examples. The second example there uses MTGP to generate random numbers in device code, and then in the same kernel a basic computation is done on the random numbers generated (count the number which have lowest bit set.) This seems to be what you're asking for (how to generate random numbers on the device and use them in device code). Is something missing there?
Also, in the documentation, it indicates that the default generator used by curand is XORWOW:
The default pseudorandom generator, XORWOW,...
and here also.
Related
I'm trying to find out exactly how java.util.Random in java 8 generates its random numbers, more specifically the algorithm behind it. All I keep seeing is how to generate random numbers in java 8 and not the driving forces behind it.
If you could point me to any documentation regarding the PRNG that java.util.Random uses that would be perfect.
Also in case its been done already, is there a way of replicating the output of java.util.Random in python?
A Quick test using a seed of 5 and int range of 0 to 100 gives different results to pythons random module
Quoting Java docs:
An instance of this class is used to generate a stream of pseudorandom numbers. The class uses a 48-bit seed, which is modified using a linear congruential formula. (See Donald Knuth, The Art of Computer Programming, Volume 2, Section 3.2.1.)
So it seems that a Linear congruential generator with a 48bit seed is used.
I do not have access to the mentioned book, but I would guess it gives more detailled information.
I have written a short monte carlo integration algorithm to calculate an integral in Fortran 90. I once compared the result obtained by solving the integral with respect to some parameter using the intrinsic random number generator with the random number generator method ran1 presented in Numerical Recipes for Fortran90 Volume 2.
Running the same algorithm twice, once calling the intrinsic random_seed(), then always call random_number() and once calling the ran1() method provided in the Numerical Recipe book I obtain as result in principal the same shape but the intrinsic result is a continuous curve in contrast to the ran1 result. In both cases I call the function with random parameters 10,000 times for a parameter value q, add it and then go on to the next q value and call the function 10,000 times etc.
A comparative image of the result can be found here:
If I increase the number of calls both curves converge. But I was wondering: why does the intrinsic random number generator generate this smoothness? Is it still generally advised to use it or are there are other more advised RNG? I suppose the continuous result is a result of the "less" randomness of the intrinsic number generator.
(I left out the source code as I don't think that there is a lot of input from it. If somebody cares I can hand it in later.)
There are NO guarantees about the quality of the pseudo random generator in standard Fortran. If you care about some particular quality of implementation for cryptography or science sensitive to random numbers (Monte-Carlo), you should use some library which you have control about.
You can study the manual of your compiler to find out what it says about the random number generator, but every compiler can implement a completely different algorithm to generate random numbers.
Numerical Recipes is actually not well received by some people in the numerical mathematics community http://www.uwyo.edu/buerkle/misc/wnotnr.html
This site is not for software recommendation, but this article (link given by roygvib in a comment): https://arxiv.org/abs/1005.4117 is a good review with examples of bad and good algorithms, methods how to test them, how to generate arbitrary number distributions and examples of calls of two example libraries in C (one of them can be called from Fortran as well).
Personally I use this https://bitbucket.org/LadaF/elmm/src/master/src/rng_par_zig.f90 parallel PRNG, but I didn't test the quality, I personally just need speed. But this is not a software recommendation site.
The particular random number generator used depends on the compiler. Perhaps documented, perhaps not. Perhaps subject to change. For high quality work, I would use library / source code from elsewhere. A webpage with Fortran implementations of the Mersenne Twister: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/FORTRAN/fortran.html. I have used http://theo.phys.sci.hiroshima-u.ac.jp/~ishikawa/PRNG/mt_stream_en.html and verified against the original implementation. This version is useful for multi-threaded programs.
I agree with Vladamir F. But...
To help you in your quest for better random numbers, consider the fairly recent addition to C++, C++11 Pseudo Random Number Generators. Mersenne twister and lots of others are there. It's a pretty well-thought out system. I see two ways you could test those sequences:
make a system call from Fortran to a little C++ utility that generates a bunch of them for you or
bind the random number generators to Fortran through ISO_C_BINDING.
I prefer the second and because the bindings are a little tricky, I have prepared an example. The files are:
cxx11_rand.h and cxx11_rand.cpp which define calls to the random number generator
c_rand.cpp which calls the C++ functions in C functions
f_rand_M.f90 which binds the C functions to Fortran
f_main.f90 which uses the module functions to generate 10 random numbers in [0,1).
cxx11_rand.h
#ifndef CXX11_RAND_H
#define CXX11_RAND_H
void cxx11_init();
double cxx11_rand();
#endif
cxx11_rand.cpp
#include <algorithm>
#include <cstdio>
#include <memory>
#include <random>
#include <set>
#include <vector>
#include <chrono>
#include <thread>
#include <iostream>
#include "cxx11_rand.h"
static std::unique_ptr<std::uniform_real_distribution<>> dist;
static std::unique_ptr<std::mt19937_64> eng;
void cxx11_init(){
eng = std::unique_ptr<std::mt19937_64>( new std::mt19937_64(std::random_device{}()) );
dist = std::unique_ptr< std::uniform_real_distribution<> >( new std::uniform_real_distribution<>(0.0,1.0) );
}
double cxx11_rand(){
return (*dist)( *eng );
}
c_rand.cpp
#include "cxx11_rand.h"
#ifdef __cplusplus
extern "C" {
#endif
void c_init(){
cxx11_init();
}
double c_rand(){
return cxx11_rand();
}
#ifdef __cplusplus
}
#endif
f_rand_M.f90
module f_rand_M
implicit none
!define fortran interface bindings to C functions
interface
subroutine fi_init() bind(C, name="c_init")
end subroutine
real(C_DOUBLE) function fi_rand() bind(C, name="c_rand")
use ISO_C_BINDING, only: C_DOUBLE
end function
end interface
contains
subroutine f_init()
call fi_init()
end subroutine
real(C_DOUBLE) function f_rand()
use ISO_C_BINDING, only: C_DOUBLE
f_rand = fi_rand()
end function
end module
f_main.f90
program main
use f_rand_M
implicit none
integer :: i
call f_init()
do i=1,10
write(*,*)f_rand()
end do
end program
You can compile/link with the following GNU commands
echo "compiling objects"
g++ -c --std=c++11 cxx11_rand.cpp c_rand.cpp
gfortran -c f_rand_M.f90
echo "building & executing fortran main"
gfortran f_main.f90 f_rand_M.o c_rand.o cxx11_rand.o -lstdc++ -o f_main.exe
./f_main.exe
Your output should look like this (with different random numbers of course--the seed here was chosen from a "source of entropy", e.g. wall time).
compiling objects
building & executing fortran main
0.47439556226575341
0.11177335018127127
0.10417488557661241
0.77378163596792404
0.20780793755332663
0.27951447624366532
0.66920698086955666
0.80676663600103105
0.98028384008440417
0.88893587108730432
I used GCC 4.9 on a Mac for testing.
PRNG is a bad option when you are doing MC and it does not matter in which programming language. For MC simulations it is always good idea to use service like Random ORG or hardware random number generator. The book Effective Java, in Item 47, clearly shows an example of problematic PRNG. With PRNGs you are never sure that you are OK with their internal implementation.
I have a Deck vector with 52 Card, and I want to shuffle it.
vector<Card^> cards;
So I used this:
random_shuffle(cards.begin(), cards.end());
The problem was that it gave me the same result every time, so I used srand to randomize it:
srand(unsigned(time(NULL)));
random_shuffle(cards.begin(),cards.end());
This was still not truly random. When I started dealing cards, it was the same as in the last run. For example: "1. deal: A,6,3,2,K; 2. deal: Q,8,4,J,2", and when I restarted the program I got exactly the same order of deals.
Then I used srand() and random_shuffle with its 3rd parameter:
int myrandom (int i) {
return std::rand()%i;
}
srand(unsigned(time(NULL)));
random_shuffle(cards.begin(),cards.end(), myrandom);
Now it's working and always gives me different results on re-runs, but I don't know why it works this way. How do these functions work, what did I do here?
This answer required some investigation, looking at the C++ Standard Library headers in VC++ and looking at the C++ standard itself. I knew what the standard said, but I was curious about VC++ (including C++CLI) did their implementation.
First what does the standard say about std::random_shuffle . We can find that here. In particular it says:
Reorders the elements in the given range [first, last) such that each possible permutation of those elements has equal probability of appearance.
1) The random number generator is implementation-defined, but the function std::rand is often used.
The bolded part is key. The standard says that the RNG can be implementation specific (so results across different compilers will vary). The standard suggests that std::rand is often used. But this isn't a requirement. So if an implementation doesn't use std::rand then it follows that it likely won't use std::srand for a starting seed. An interesting footnote is that the std::random_shuffle functions are deprecated as of C++14. However std::shuffle remains. My guess is that since std::shuffle requires you to provide a function object you are explicitly defining the behavior you want when generating random numbers, and that is an advantage over the older std::random_shuffle.
I took my VS2013 and looked at the C++ standard library headers and discovered that <algorithm> uses template class that uses a completely different pseudo-rng (PRNG) than std::rand with an index (seed) set to zero. Although this may vary in detail between different versions of VC++ (including C++/CLI) I think it is probable that most versions of VC++/CLI do something similar. This would explain why each time you run your application you get the same shuffled decks.
The option I would opt for if I am looking for a Pseudo RNG and I'm not doing cryptography is to use something well established like Mersenne Twister:
Advantages The commonly-used version of Mersenne Twister, MT19937, which produces a sequence of 32-bit integers, has the following desirable properties:
It has a very long period of 2^19937 − 1. While a long period is not a guarantee of quality in a random number generator, short periods (such as the 2^32 common in many older software packages) can be problematic.
It is k-distributed to 32-bit accuracy for every 1 ≤ k ≤ 623 (see definition below).
It passes numerous tests for statistical randomness, including the Diehard tests.
Luckily for us C++11 Standard Library (which I believe should work on VS2010 and later C++/CLI) includes a Mersenne Twister function object that can be used with std::shuffle Please see this C++ documentation for more details. The C++ Standard Library reference provided earlier actually contains code that does this:
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(v.begin(), v.end(), g);
The thing to note is that std::random_device produces non-deterministic (non repeatable) unsigned integers. We need non-deterministic data if we want to seed our Mersenne Twister (std::mt19937) PRNG with. This is similar in concept to seeding rand with srand(time(NULL)) (The latter not being an overly good source of randomness).
This looks all well and good but has one disadvantage when dealing with card shuffling. An unsigned integer on the Windows platform is 4 bytes (32 bits) and can store 2^32 values. This means there are only 4,294,967,296 possible starting points (seeds) therefore only that many ways to shuffle the deck. The problem is that there are 52! (52 factorial) ways to shuffle a standard 52 card deck. That happens to be 80658175170943878571660636856403766975289505440883277824000000000000 ways, which is far bigger than the number of unique ways we can get from setting a 32-bit seed.
Thankfully, Mersenne Twister can accept seeds between 0 and 2^19937-1. 52! is a big number but all combinations can be represented with a seed of 226 bits (or ~29 bytes). The Standard Library allow std::mt19937 to accept a seed up to 2^19937-1 (~624 bytes of data) if we so choose. But since we need only 226 bits the following code would allow us to create 29 bytes of non-deterministic data to be used as a suitable seed for std::mt19937:
// rd is an array to hold 29 bytes of seed data which covers the 226 bits we need */
std::array<unsigned char, 29> seed_data;
std::random_device rd;
std::generate_n(seed_data.data(), seed_data.size(), std::ref(rd));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
// Set the seed for Mersenne *using the 29 byte sequence*
std::mt19937 g(seq);
Then all you need to do is call shuffle with code like:
std::shuffle(cards.begin(),cards.end(), g);
On Windows VC++/CLI you will get a warning that you'll want to suppress with the code above. So at the top of the file (before other includes) you can add this:
#define _SCL_SECURE_NO_WARNINGS 1
Today I had a talk with a friend of mine told me he tries to make some monte carlo simulations using GPU. What was interesting he told me that he wanted to draw numbers randomly on different processors and assumed that there were uncorrelated. But they were not.
The question is, whether there exists a method to draw independent sets of numbers on several GPUs? He thought that taking a different seed for each of them would solve the problem, but it does not.
If any clarifications are need please let me know, I will ask him to provide more details.
To generate completely independent random numbers, you need to use a parallel random number generator. Essentially, you choose a single seed and it generates M independent random number streams. So on each of the M GPUs you could then generate random numbers from independent streams.
When dealing with multiple GPUs you need to be aware that you want:
independent streams within GPUs (if RNs are generate by each GPU)
independent streams between GPUs.
It turns out that generating random numbers on each GPU core is tricky (see this question I asked a while back). When I've been playing about with GPUs and RNs, you only get a speed-up generating random on the GPU if you generate large numbers at once.
Instead, I would generate random numbers on the CPU, since:
It's easier and sometimes quicker to generate them on the CPU and transfer across.
You can use well tested parallel random number generators
The types of off-the shelf random number generators available for GPUs is very limited.
Current GPU random number libraries only generate RNs from a small number of distributions.
To answer your question in the comments: What do random numbers depend on?
A very basic random number generator is the linear congruential generator. Although this generator has been surpassed by newer methods, it should give you an idea of how they work. Basically, the ith random number depends on the (i-1) random number. As you point out, if you run two streams long enough, they will overlap. The big problem is, you don't know when they will overlap.
For generating iid uniform variables, you just have to initialize your generators with differents seeds. With Cuda, you may use the NVIDIA Curand Library which implements the Mersenne Twister generator.
For example, the following code executed by 100 kernels in parallel, will draw 10 sample of a (R^10)-uniform
__global__ void setup_kernel(curandState *state,int pseed)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
int seed = id%10+pseed;
/* 10 differents seed for uncorrelated rv,
a different sequence number, no offset */
curand_init(seed, id, 0, &state[id]);
}
If you take any ``good'' generator (e.g. Mersenne Twister etc), two sequences with different random seeds will be uncorrelated, be it on GPU or CPU. Hence I'm not sure what you mean by saying taking different seeds on different GPUs were not enough. Would you elaborate?
I've tried for hours to find the implementation of rand() function used in gcc...
It would be much appreciated if someone could reference me to the file containing it's implementation or website with the implementation.
By the way, which directory (I'm using Ubuntu if that matters) contains the c standard library implementations for the gcc compiler?
rand consists of a call to a function __random, which mostly just calls another function called __random_r in random_r.c.
Note that the function names above are hyperlinks to the glibc source repository, at version 2.28.
The glibc random library supports two kinds of generator: a simple linear congruential one, and a more sophisticated linear feedback shift register one. It is possible to construct instances of either, but the default global generator, used when you call rand, uses the linear feedback shift register generator (see the definition of unsafe_state.rand_type).
You will find C library implementation used by GCC in the GNU GLIBC project.
You can download it sources and you should find rand() implementation. Sources with function definitions are usually not installed on a Linux distribution. Only the header files which I guess you already know are usually stored in /usr/include directory.
If you are familiar with GIT source code management, you can do:
$ git clone git://sourceware.org/git/glibc.git
To get GLIBC source code.
The files are available via FTP. I found that there is more to rand() used in stdlib, which is from [glibc][2]. From the 2.32 version (glibc-2.32.tar.gz) obtained from here, the stdlib folder contains a random.c file which explains that a simple linear congruential algorithm is used. The folder also has rand.c and rand_r.c which can show you more of the source code. stdlib.h contained in the same folder will show you the values used for macros like RAND_MAX.
/* An improved random number generation package. In addition to the
standard rand()/srand() like interface, this package also has a
special state info interface. The initstate() routine is called
with a seed, an array of bytes, and a count of how many bytes are
being passed in; this array is then initialized to contain
information for random number generation with that much state
information. Good sizes for the amount of state information are
32, 64, 128, and 256 bytes. The state can be switched by calling
the setstate() function with the same array as was initialized with
initstate(). By default, the package runs with 128 bytes of state
information and generates far better random numbers than a linear
congruential generator. If the amount of state information is less
than 32 bytes, a simple linear congruential R.N.G. is used.
Internally, the state information is treated as an array of longs;
the zeroth element of the array is the type of R.N.G. being used
(small integer); the remainder of the array is the state
information for the R.N.G. Thus, 32 bytes of state information
will give 7 longs worth of state information, which will allow a
degree seven polynomial. (Note: The zeroth word of state
information also has some other information stored in it; see setstate
for details). The random number generation technique is a linear
feedback shift register approach, employing trinomials (since there
are fewer terms to sum up that way). In this approach, the least
significant bit of all the numbers in the state table will act as a
linear feedback shift register, and will have period 2^deg - 1
(where deg is the degree of the polynomial being used, assuming
that the polynomial is irreducible and primitive). The higher order
bits will have longer periods, since their values are also
influenced by pseudo-random carries out of the lower bits. The
total period of the generator is approximately deg*(2deg - 1); thus
doubling the amount of state information has a vast influence on the
period of the generator. Note: The deg*(2deg - 1) is an
approximation only good for large deg, when the period of the shift
register is the dominant factor. With deg equal to seven, the
period is actually much longer than the 7*(2**7 - 1) predicted by
this formula. */