I have implemented random sequence generator in python and want to test the results in TestU01. But I am not getting how to give input for that test suite and also suggest me that how many bit sequence I need to generate to test the sequence
TestU01 is a library and doesn't come with executables. It mostly has methods to test C generators which implement unif01_Gen defined in unif01.h. See guideshorttest01.pdf.
However, it does come with a few methods which test binary files. Here is a short program which calls them:
#include <stdio.h>
#include "gdef.h"
#include "swrite.h"
#include "bbattery.h"
int main (int argc, char *argv[])
{
if (argc != 2) {
printf("Specify binary file of random bits as ./test <path>");
return 0;
}
FILE* fp = fopen(argv[1], "r");
fseek(fp, 0L, SEEK_END);
size_t sz = ftell(fp) * 8;
fclose(fp);
printf("Reading binary file %s of size %d bits", argv[1], sz);
swrite_Basic = FALSE;
bbattery_RabbitFile (argv[1], sz);
bbattery_AlphabitFile (argv[1], sz);
bbattery_FIPS_140_2File (argv[1]);
return 0;
}
After installing TestU01 (it's in the Arch/Manjaro AUR, in case that helps), compile it with: gcc test.c -o test -ltestu01
Here is a Python program which generates a random binary file. Note that the tests work on 32-bit blocks, and I suggest to stick to that when generating the file.
size = 1024*1024
rand = Random()
with open("bits", "wb") as f:
for i in range(size//4):
value = rand.getrandbits(32)
s = struct.pack('I', value)
f.write(s)
There is also a version of SmallCrush which reads a text file of about 5 million floats. See bbattery_SmallCrushFile. I haven't tried it, but make sure the floats are written with many digits as the conversion to/from text can break the test.
I don't know much about the theory of testing RNGs, so I can't answer how long of a sequence you need. The TestU01 guide is detailed and might answer your questions.
Related
Is it safe to pass function pointers via MPI as a way of telling another node to call a function? Someone may say that Passing any kind of pointers via MPI is meaningless, but I wrote some code to verify it.
//test.cpp
#include <cstdio>
#include <iostream>
#include <mpi.h>
#include <cstring>
using namespace std;
int f1(int a){return a + 1;}
int f2(int a){return a + 2;}
int f3(int a){return a + 3;}
using F=int (*)(int);
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int rank, size;
MPI_Status state;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
//test
char data[10];
if( 0 == rank ){
*(reinterpret_cast<F*>(data))=&f2;
for(int i = 1 ; i < size ; ++i)
MPI_Send(data, 8, MPI_CHAR, i, 0, MPI_COMM_WORLD);
}else{
MPI_Recv(data, 8, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &state);
F* fp = reinterpret_cast<F*>(data);
int ans = (**fp)(10);
cout << ans << endl;
}
MPI_Finalize();
return 0;
}
Here is the output:
12
12
12
12
12
12
12
12
12
I ran it via MVAPICH, and it works well. But I just don't now why since separate address spaces means that the pointer value is USELESS in any process other than the one that generated it.
P.S. here is my hostfile
blade11:1
blade12:1
blade13:1
blade14:1
blade15:1
blade16:1
blade17:1
blade18:2
blade19:1
and I ran mpiexec -n 10 -f hostfile ./test, and compiled it using C++11
You are lucky in the sense that your cluster environment is homogeneous and no address space randomisation for ordinary executables is in place. As a consequence, all images are loaded at the same base address and laid out similarly in memory, hence functions have the same virtual addresses in all MPI ranks (note that this is rarely true for symbols from dynamically linked libraries as those are usually loaded at random addresses).
If you compile the source twice using different compilers or using the same compiler but with different compiler options, then have some ranks run the first executable and the rest run the second one, the program will definitely crash.
Try this:
$ mpicxx -std=c++11 -O0 -o test_O0 test.cpp
$ mpicxx -std=c++11 -O2 -o test_O2 test.cpp
$ mpiexec -f hostfile -n 5 ./test_O0 : -n 5 ./test_O2
12
12
12
12
<crash>
The different levels of optimisation result in function code of different size in test_O0 and test_O2. Consequently, f2 will no longer have the same virtual address in all ranks. The ranks that run the same executable as rank 0 will print 12, while the rest will segfault.
Is it safe to pass function pointers via MPI as a way of telling another node to call a function?
No, it is not. Address space is not shared among processes.
However, MPI processes which are the result of programs built from the same source can be organised to call a specific function when a certain message is received:
char data = 0;
MPI_Recv(data, 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &state);
if (data == 255) {
f2(10); /* and so forth */
}
No.
However there is trick involving macros that map a certain codification of a function to a local function pointer/callback that can be recognized in all processes uniformly.
For example, this is used in HPX http://stellar.cct.lsu.edu/files/hpx_0.9.5/html/HPX_PLAIN_ACTION.html to run a function across inhomogeneous systems.
Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
__global__ void cudaRand(float *d_out, unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed, i, 0, &state);
d_out[i] = curand_uniform(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
float *v = new float[N];
float *d_out;
cudaMalloc((void**)&d_out, N * sizeof(float));
// generate random numbers
cudaRand << < 1, N >> > (d_out, time(NULL));
cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
{
printf("out: %f\n", v[i]);
}
cudaFree(d_out);
delete[] v;
return 0;
}
UPDATE
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>
__global__ void cudaRand(double *d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock() + i, 0, 0, &state);
d_out[i] = curand_uniform_double(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
double *h_v = new double[N];
double *d_out;
cudaMalloc((void**)&d_out, N * sizeof(double));
// generate random numbers
cudaRand << < 1, N >> > (d_out);
cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
printf("out: %f\n", h_v[i]);
cudaFree(d_out);
delete[] h_v;
return 0;
}
How I was dealing with the similar situation in the past, within __device__/__global__ function:
int tId = threadIdx.x + (blockIdx.x * blockDim.x);
curandState state;
curand_init((unsigned long long)clock() + tId, 0, 0, &state);
double rand1 = curand_uniform_double(&state);
double rand2 = curand_uniform_double(&state);
So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.
EDIT:
However, based on below comments, proposed approach may perhaps lead to biased result:
JackOLantern pointed me to this part of curand documentation:
Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.
Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:
One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.
But the same poster is later stating:
The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.
So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:
curand_init((unsigned long long)clock(), tId, 0, &state)
Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.
I see that JackOLantern stated in comment that:
I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.
I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:
developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.
So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.
Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.
I wrote a program to test the speed of memcpy(). However, how memory are allocated greatly influences the speed.
CODE
#include<stdlib.h>
#include<stdio.h>
#include<sys/time.h>
void main(int argc, char *argv[]){
unsigned char * pbuff_1;
unsigned char * pbuff_2;
unsigned long iters = 1000*1000;
int type = atoi(argv[1]);
int buff_size = atoi(argv[2])*1024;
if(type == 1){
pbuff_1 = (void *)malloc(2*buff_size);
pbuff_2 = pbuff_1+buff_size;
}else{
pbuff_1 = (void *)malloc(buff_size);
pbuff_2 = (void *)malloc(buff_size);
}
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buff_size);
}
if(type == 1){
free(pbuff_1);
}else{
free(pbuff_1);
free(pbuff_2);
}
}
The OS is linux-2.6.35 and the compiler is GCC-4.4.5 with options "-std=c99 -O3".
Results on my computer(memcpy 4KB, iterate 1 million times):
time ./test.test 1 4
real 0m0.128s
user 0m0.120s
sys 0m0.000s
time ./test.test 0 4
real 0m0.422s
user 0m0.420s
sys 0m0.000s
This question is related with a previous question:
Why does the speed of memcpy() drop dramatically every 4KB?
UPDATE
The reason is related with GCC compiler, and I compiled and run this program with different versions of GCC:
GCC version--------4.1.3--------4.4.5--------4.6.3
Time Used(1)-----0m0.183s----0m0.128s----0m0.110s
Time Used(0)-----0m1.788s----0m0.422s----0m0.108s
It seems GCC is getting smarter.
The specific addresses returned by malloc are selected by the implementation and not always optimal for the using code. You already know that the speed of moving memory around depends greatly on cache and page effects.
Here, the specific pointers malloced are not known. You could print them out using printf("%p", ptr). What is known however, is that using just one malloc for two blocks surely avoids page and cache waste between the two blocks. That may already be the reason for the speed difference.
The documentation for GMP seems to list only the following algorithms for random number generation:
gmp_randinit_mt, the Mersenne Twister;
gmp_randinit_lc_2exp and gmp_randinit_lc_2exp_size, linear congruential.
There is also gmp_randinit_default, but it points to gmp_randinit_mt.
Neither the Mersenne Twister nor linear congruential generators should be used for Cryptography.
What do people usually do, then, when they want to use the GMP to build some cryptographic code?
(Using a cryptographic API for encrypting/decrypting/etc doesn't help, because I'd actually implement a new algorithm, which crypto libraries do not have).
Disclaimer: I have only "tinkered" with RNGs, and that was over a year ago.
If you are on a linux box, the solution is relatively simple and non-deterministic. Just open and read a desired number of bits from /dev/urandom. If you need a large number of random bits for your program however, then you might want to use a smaller number of bits from /dev/urandom as seeds for a PRNG.
boost offers a number of PRNGs and a non-deterministic RNG, random_device. random_device uses the very same /dev/urandom on linux and a similar(IIRC) function on windows, so if you need windows or x-platform.
Of course, you just might want/need to write a function based on your favored RNG using GMP's types and functions.
Edit:
#include<stdio.h>
#include<gmp.h>
#include<boost/random/random_device.hpp>
int main( int argc, char *argv[]){
unsigned min_digits = 30;
unsigned max_digits = 50;
unsigned quantity = 1000; // How many numbers do you want?
unsigned sequence = 10; // How many numbers before reseeding?
mpz_t rmin;
mpz_init(rmin);
mpz_ui_pow_ui(rmin, 10, min_digits-1);
mpz_t rmax;
mpz_init(rmax);
mpz_ui_pow_ui(rmax, 10, max_digits);
gmp_randstate_t rstate;
gmp_randinit_mt(rstate);
mpz_t rnum;
mpz_init(rnum);
boost::random::random_device rdev;
for( unsigned i = 0; i < quantity; i++){
if(!(i % sequence))
gmp_randseed_ui(rstate, rdev.operator ()());
do{
mpz_urandomm(rnum, rstate, rmax);
}while(mpz_cmp(rnum, rmin) < 0);
gmp_printf("%Zd\n", rnum);
}
return 0;
}
I want to use GSL's uniform random number generator. On their website, they include this sample code:
#include <stdio.h>
#include <gsl/gsl_rng.h>
int
main (void)
{
const gsl_rng_type * T;
gsl_rng * r;
int i, n = 10;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
for (i = 0; i < n; i++)
{
double u = gsl_rng_uniform (r);
printf ("%.5f\n", u);
}
gsl_rng_free (r);
return 0;
}
However, this does not rely on any seed and so, the same random numbers will be produced each time.
They also specify the following:
The generator itself can be changed using the environment variable GSL_RNG_TYPE. Here is the output of the program using a seed value of 123 and the multiple-recursive generator mrg,
$ GSL_RNG_SEED=123 GSL_RNG_TYPE=mrg ./a.out
But I don't understand how to implement this. Any ideas as to what modifications I can make to the above code to incorporate the seed?
The problem is that a new seed is not being generated. If you just want a function that returns a darn random number, and care nothing about the sticky details of how it's generated, try this. Assumes that you have the GSL installed.
#include <iostream>
#include <gsl/gsl_math.h>
#include <gsl/gsl_rng.h>
#include <sys/time.h>
float keithRandom() {
// Random number function based on the GNU Scientific Library
// Returns a random float between 0 and 1, exclusive; e.g., (0,1)
const gsl_rng_type * T;
gsl_rng * r;
gsl_rng_env_setup();
struct timeval tv; // Seed generation based on time
gettimeofday(&tv,0);
unsigned long mySeed = tv.tv_sec + tv.tv_usec;
T = gsl_rng_default; // Generator setup
r = gsl_rng_alloc (T);
gsl_rng_set(r, mySeed);
double u = gsl_rng_uniform(r); // Generate it!
gsl_rng_free (r);
return (float)u;
}
Read 18.6 Random number environment variables to see what that gsl_rng_env_setup() function is doing. It is getting a generator type and seed from environment variables.
Then see 18.3 Random number generator initialization - if you don't want to get the seed from an environment variable, you can use gsl_rng_set() to set the seed.
A complete answer to this question with a sample code can be seen in in this link.
Just for completeness I am putting a copy of the code for a function to create a seed here. It is written by Robert G. Brown: http://www.phy.duke.edu/~rgb/ .
#include <stdio.h>
#include <sys/time.h>
unsigned long int random_seed()
{
unsigned int seed;
struct timeval tv;
FILE *devrandom;
if ((devrandom = fopen("/dev/random","r")) == NULL) {
gettimeofday(&tv,0);
seed = tv.tv_sec + tv.tv_usec;
} else {
fread(&seed,sizeof(seed),1,devrandom);
fclose(devrandom);
}
return(seed);
}
But from my own experience with this function, I would say that the dev/random solution is very time consuming compared to the gettimeofday(), you can check it out. So, the gettimeofday() solution, might be better for you if its level of accuracy is enough:
#include <stdio.h>
#include <sys/time.h>
unsigned long int random_seed()
{
struct timeval tv;
gettimeofday(&tv,0);
return (tv.tv_sec + tv.tv_usec);
}