How Many cuRand States Required for Thread-Unique Random Numbers? (CUDA) - random

How many cuRand states are required to get unique random numbers in every thread? From other questions posted on the site, some have said that you need one per thread and others say you need one per block.
Does using one cuRand state per thread mean better random numbers?
Does using 1 cuRand state per thread slow down CUDA applications significantly (5000 + threads)?
Also for the implementation of using 1 cuRand state per thread, does this kernel look right and efficient?:
__global__ void myKernel (const double *seeds) // seeds is an array of length = #threads
int tid = ... // set tid = global thread ID
{
curandState s;
curand_init (seeds[tid],0,0,&s)
....
double r = cuRand_uniform(&s);
...
}

Assuming that all your threads stay synchronized, then you want to generate the random numbers in all threads as shown in your sample code all at the same time. However, from what I understand, you do not need to seed the cuRAND differently in each thread. I may be wrong on that one though...
Now, they use the term "block" in the documentation as in "create all your random numbers in one block". They do not mean that one block of threads will do the work, instead it means one block of memory will hold all the random numbers all generated in one single call. So if you need, say, 4096 random numbers in your loop, you should create them all at once at the start, then load them back from memory later... You'll have to test to see whether it makes things faster in your case anyway. Often, many memory accesses slow down things, but calling the generator many times is not unlikely slower as it certainly needs to reload a heavy set of values to compute the next pseudo random number(s).
Source:
http://docs.nvidia.com/cuda/curand/host-api-overview.html#performance-notes2

Related

srand(time(NULL)) c++ isnt working properly [duplicate]

This question is about a comment in this question
Recommended way to initialize srand? The first comment says that srand() should be called only ONCE in an application. Why is it so?
That depends on what you are trying to achieve.
Randomization is performed as a function that has a starting value, namely the seed.
So, for the same seed, you will always get the same sequence of values.
If you try to set the seed every time you need a random value, and the seed is the same number, you will always get the same "random" value.
Seed is usually taken from the current time, which are the seconds, as in time(NULL), so if you always set the seed before taking the random number, you will get the same number as long as you call the srand/rand combo multiple times in the same second.
To avoid this problem, srand is set only once per application, because it is doubtful that two of the application instances will be initialized in the same second, so each instance will then have a different sequence of random numbers.
However, there is a slight possibility that you will run your app (especially if it's a short one, or a command line tool or something like that) many times in a second, then you will have to resort to some other way of choosing a seed (unless the same sequence in different application instances is ok by you). But like I said, that depends on your application context of usage.
Also, you may want to try to increase the precision to microseconds (minimizing the chance of the same seed), requires (sys/time.h):
struct timeval t1;
gettimeofday(&t1, NULL);
srand(t1.tv_usec * t1.tv_sec);
Random numbers are actually pseudo random. A seed is set first, from which each call of rand gets a random number, and modifies the internal state and this new state is used in the next rand call to get another number. Because a certain formula is used to generate these "random numbers" therefore setting a certain value of seed after every call to rand will return the same number from the call. For example srand (1234); rand (); will return the same value. Initializing once the initial state with the seed value will generate enough random numbers as you do not set the internal state with srand, thus making the numbers more probable to be random.
Generally we use the time (NULL) returned seconds value when initializing the seed value. Say the srand (time (NULL)); is in a loop. Then loop can iterate more than once in one second, therefore the number of times the loop iterates inside the loop in a second rand call in the loop will return the same "random number", which is not desired. Initializing it once at program start will set the seed once, and each time rand is called, a new number is generated and the internal state is modified, so the next call rand returns a number which is random enough.
For example this code from http://linux.die.net/man/3/rand:
static unsigned long next = 1;
/* RAND_MAX assumed to be 32767 */
int myrand(void) {
next = next * 1103515245 + 12345;
return((unsigned)(next/65536) % 32768);
}
void mysrand(unsigned seed) {
next = seed;
}
The internal state next is declared as global. Each myrand call will modify the internal state and update it, and return a random number. Every call of myrand will have a different next value therefore the the method will return the different numbers every call.
Look at the mysrand implementation; it simply sets the seed value you pass to next. Therefore if you set the next value the same everytime before calling rand it will return the same random value, because of the identical formula applied on it, which is not desirable, as the function is made to be random.
But depending on your needs you can set the seed to some certain value to generate the same "random sequence" each run, say for some benchmark or others.
Short answer: calling srand() is not like "rolling the dice" for the random number generator. Nor is it like shuffling a deck of cards. If anything, it's more like just cutting a deck of cards.
Think of it like this. rand() deals from a big deck of cards, and every time you call it, all it does is pick the next card off the top of the deck, give you the value, and return that card to the bottom of the deck. (Yes, that means the "random" sequence will repeat after a while. It's a very big deck, though: typically 4,294,967,296 cards.)
Furthermore, every time your program runs, a brand-new pack of cards is bought from the game shop, and every brand-new pack of cards always has the same sequence. So unless you do something special, every time your program runs, it will get exactly the same "random" numbers back from rand().
Now, you might say, "Okay, so how do I shuffle the deck?" And the answer -- at least as far as rand and srand are concerned -- is that there is no way of shuffling the deck.
So what does srand do? Based on the analogy I've been building here, calling srand(n) is basically like saying, "cut the deck n cards from the top". But wait, one more thing: it's actually start with another brand-new deck and cut it n cards from the top.
So if you call srand(n), rand(), srand(n), rand(), ..., with the same n every time, you won't just get a not-very-random sequence, you'll actually get the same number back from rand() every time. (Probably not the same number you handed to srand, but the same number back from rand over and over.)
So the best you can do is to cut the deck once, that is, call srand() once, at the beginning of your program, with an n that's reasonably random, so that you'll start at a different random place in the big deck each time your program runs. With rand(), that really is the best you can do.
[P.S. Yes, I know, in real life, when you buy a brand-new deck of cards it's typically in order, not in random order. For the analogy here to work, I'm imagining that each deck you buy from the game shop is in a seemingly random order, but the exact same seemingly-random order as every other deck of cards you buy from that same shop. Sort of like the identically shuffled decks of cards they use in bridge tournaments.]
Addendum: For a very cute demonstration of the fact that for a given PRNG algorithm and a given seed value, you always get the same sequence, see this question (which is about Java, not C, but anyway).
The reason is that srand() sets the initial state of the random generator, and all the values that generator produces are only "random enough" if you don't touch the state yourself in between.
For example you could do:
int getRandomValue()
{
srand(time(0));
return rand();
}
and then if you call that function repeatedly so that time() returns the same values in adjacent calls you just get the same value generated - that's by design.
A simpler solution for using srand() for generating different seeds for application instances run at the same second is as seen.
srand(time(NULL)-getpid());
This method makes your seed very close to random as there is no way to guess at what time your thread started and the pid will be different also.
srand seeds the pseudorandom number generator. If you call it more than once, you will reseed the RNG. And if you call it with the same argument, it will restart the same sequence.
To prove it, if you do something simple like this, you will see the same number printed 100 times:
#include <stdlib.h>
#include <stdio.h>
int main() {
for(int i = 0; i != 100; ++i) {
srand(0);
printf("%d\n", rand());
}
}
It seems that every time rand() runs, it will set a new seed for the next rand().
If srand() runs multiple times, the problem is if the two running happen in one second (the time(NULL) does not change), the next rand() will be the same as the rand() right after the previous srand().

lock free programming - c++ atomic

I am trying to develop the following lock free code (c++11):
int val_max;
std::array<std::atomic<int>, 255> vector;
if (vector[i] > val_max) {
val_max =vector[i];
}
The problem is that when there are many threads (128 threads), the result is not correct, because if for example val_max=1, and three threads ( with vector[i] = 5, 15 and 20) will execute the code in the next line, and it will be a data race..
So, I don't know the best way to solve the problem with the available functions in c++11 (I cannot use mutex, locks), protecting the whole code.
Any suggestions? Thanks in advance!
You need to describe the bigger problem you need to solve, and why you want multiple threads for this.
If you have a lot of data and want to find the maximum, and you want to split that problem, you're doing it wrong. If all threads try to access a shared maximum, not only is it hard to get right, but by the time you have it right, you have fully serialized accesses, thus making the entire thing an exercise in adding complexity and thread overhead to a serial program.
The correct way to make it parallel is to give every thread a chunk of the array to work on (and the array members are not atomics), for which the thread calculates a local maximum, then when all threads are done have one thread find the maximum of the individual results.
Do an atomic fetch of val_max.
If the value fetched is greater than or equal to vector[i], stop, you are done.
Do an atomic compare exchange -- compare val_max to the value you read in step 1 and exchange it for the value of vector[i] if it compares.
If the compare succeeded, stop, you are done.
Go to step 1, you raced with another thread that made forward progress.

Or-equals on constant as reduction operation (ex. value |= 1 ) thread-safe?

Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.

Parallelizing an algorithm with many exit points?

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.
I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.
You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}
A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

Thread-safe uniform random number generator

I have some parallel Fortran90 code in which each thread needs to generate the same sequence of random numbers.
I have a random number generator that seems to be thread-unsafe, since, for a given seed, I'm completely unable to repeat the same results each time I run the program.
I surfed unsuccessfully (almost) the entire web looking for some code of a thread-safe RNG. Could anyone provide me with (the link to) the code of one?
Thanks in advance!
A good Pseudorandom number generator for Fortran90 can be found in the Intel Math Kernel Vector Statistical Library. They are thread safe. Also, why does it need to be threadsafe? If you want each thread to get the same list, instantiate a new PRNG for each thread with the same seed.
Most repeatable random number generators need state in some form. Without state, they can't do what comes next. In order to be thread safe, you need a way to hold onto the state yourself (ie, it can't be global).
When you say "needs to generate the same sequence of random numbers" do you mean that
Each thread needs to generate a stream of numbers identical to the other thread? This implies choosing the seed before peeling off threads, then instantiating the a thread-local PRNG in each thread with the same seed.
or
You want to be able to repeat the same sequence of numbers between different runs of the programs, but each thread generates it's own independent sequence? In this case, you still can't share a single PRNG because the thread operation sequence is non-deterministic. So seed a single PRNG with a known seed before launching threads, and use it to generate the initial seeds for the threads. Then you instantiate thread-local generators in each thread...
In each of these cases you should note what Neil Butterworth say about the statistics: most of the usual guarantees that the PRNG like to claim are not reliable when mix streams generated in this way.
In both cases you need a thread-local PRNG. I don't know what is available in f90...but you can also write you own (lookup Mersenne Twister, and write a routne that takes the saved state as a parameter...).
In fortran 77, this would look something like
function PRNGthread (state)
double state(statesize)
c stuff happens here which uses and manipulates the state vector...
PRNGthread = result
return
and each of your threads should maintain a separate state vector, though all will use the same initial value.
I understand you need every thread to produce the same stream of random numbers.
A very good Pseudo Random Generator that will generate a reproducable stream of numbers and is quite fast is the MT19937. Just make sure that you generate the seed before spawning off the threads, but generate a separate instance of the MT in every thread (make the instance of the MT thread local). That way it will be guaranteed that every MT will produce the same stream of numbers.
How about SPRNG? I have not tried it myself though.
I coded a thread-safe Fortran 90 version of the Mersenne Twister/MT19973. The state of the PRNG is saved in a derived type (randomNumberSequence), and you use procedures to seed the generator or get the next element in the sequence.
See http://code.google.com/p/i3rc-monte-carlo-model/source/browse/trunk/Code/RandomNumbersForMC.f95
The alternatives seem to be:
Use a synchronisation object (such as
a mutex) on the generator's seed
value. This will unfortunately
serialise your code on accesses to
generator
Use thread-local storage in the
generator so each thread gets its own
seed - this may cause statstical
problems for your app
If your platform supports a suitable
atomic operation, use that on the
seed (it probably won't, however)
Not a very encouraging list, I know. And to add to it, I have no idea how to implement any of them in FORTRAN!
This article https://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP does not only link to a Fortran implementation, but mentions key points needed to make a PRNG usable with threads. The most important point is:
The Fortran90 version of Ziggurat has several variables and arrays with the 'SAVE' attribute. In order to parallelize the uniform RNG, then, it appears that the required changes are to make these variables arrays with a separate value for each thread (beware of false sharing). Then when the PRNG function is called, we must pass the thread number, and use the corresponding state value.

Resources