In GSL there are many random generators. For example, an implementation of the maximally equidistributed combined Tausworthe generator locates at gsl/taus.c. The random seeds is set in the following function:
static inline unsigned long
taus_get (void *vstate)
{
taus_state_t *state = (taus_state_t *) vstate;
#define MASK 0xffffffffUL
#define TAUSWORTHE(s,a,b,c,d) (((s &c) <<d) &MASK) ^ ((((s <<a) &MASK)^s) >>b)
state->s1 = TAUSWORTHE (state->s1, 13, 19, 4294967294UL, 12);
state->s2 = TAUSWORTHE (state->s2, 2, 25, 4294967288UL, 4);
state->s3 = TAUSWORTHE (state->s3, 3, 11, 4294967280UL, 17);
return (state->s1 ^ state->s2 ^ state->s3);
}
static void
taus_set (void *vstate, unsigned long int s)
{
taus_state_t *state = (taus_state_t *) vstate;
if (s == 0)
s = 1; /* default seed is 1 */
#define LCG(n) ((69069 * n) & 0xffffffffUL)
state->s1 = LCG (s);
state->s2 = LCG (state->s1);
state->s3 = LCG (state->s2);
/* "warm it up" */
taus_get (state);
taus_get (state);
taus_get (state);
taus_get (state);
taus_get (state);
taus_get (state);
return;
}
My question is why do they need six "warm up"? Is there something wrong if there is no warm up?
Six is likely an arbitrary number. Not to small such that there is insufficient randomness and not so large that it slows down startup.
There is no mention of it in the paper referenced in the code you posted. Another paper (http://www0.cs.ucl.ac.uk/staff/d.jones/GoodPracticeRNG.pdf) does discuss the need for "warming up" random number generators.
Essentially, it is used when "your seed values have very low entropy". The discussion is on page 9 of the referenced paper.
Related
Does anyone have an implementation of drand48() or an equivalent that can work in an OpenCL kernel?
I have been sending random numbers generated on the host through a buffer but I need random numbers generated on the device if there is any way to do this.
Here's an OpenCL device function which you can call from an OpenCL kernel:
uint rng_next(__global ulong *states, uint index) {
/* Assume 32 bits */
uint bits = 32;
/* Get current state */
ulong state = states[index];
/* Update state */
state = (state * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
/* Keep new state */
states[index] = state;
/* Return value */
return (uint) (state >> (48 - bits));
}
The states array contains the state of the PRNG for each work-item and the index is basically - but not necessarily - the work-item ID (which you can get with get_global_id()).
The states array can be generated in the host (using another PRNG) and copied to the device, or it can be initialized in the device using some kind of hash function applied to the work-item global IDs. If you use the work-item global IDs as initial seeds, the random streams for each work-item will be very low quality (due to high correlation between them). Here's a kernel to apply a hash function to decorrelate the initial seeds (note you need a main initial seed, passed by the host):
__kernel void rng_init(
const ulong main_seed,
__global clo_statetype *seeds) {
/* Get initial seed for this workitem. */
ulong seed = get_global_id(0) + main_seed;
/* Apply basic xor-shift hash, better ones probably exist. */
seed = ((seed >> 16) ^ seed) * 0x45d9f3b;
seed = ((seed >> 16) ^ seed) * 0x45d9f3b;
seed = ((seed >> 16) ^ seed);
/* Update seeds array. */
seeds[get_global_id(0)] = seed;
}
Note that, as pointed out in the comments, the drand48 is of very low quality, and if you use a lot of work-items you will see artifacts in your rendering. This post explains this in more detail.
This code is taken from the cl_ops library, which I'm the author of.
I've been using the Intel-provided RNG feature for some time, to provide myself with some randomness by means of a C++/CLI program I wrote myself.
However, after some time, something struck me as particularly suspicious. Among other uses, I ask for a random number between 1 and 4 and wrote the result down on paper each time. Here are the results :
2, 3, 3, 2, 1, 3, 4, 2, 3, 2, 3, 1, 3, 2, 3, 1, 2, 4, 2, 2, 1, 2, 1, 3, 1, 3, 3, 3, 3.
Number of 1s : 6
Number of 2s : 9
Number of 3s : 12
Number of 4s : 2
Total : 29
I'm actually wondering if there's a problem either with Intel's RNG, my algorithm, methodology or something else maybe ? Or do you consider the bias not to be significant enough yet ?
I'm using Windows 10 Pro, my CPU is an Intel Core i7-4710MQ.
Compiled with VS2017.
Methodology :
Start a Powershell command prompt
Load my assembly with Add-Type -Path <mydll>
Invoke [rdrw.Random]::Next(4)
Add one to the result
A detail that may be of importance : I don't ask for that number very often, so there's some time between draws and it usually comes when the RNG hasn't been used for some time (one hour at least).
And yes it's a lazy algorithm, I didn't want to bother myself with exceptions.
Algorithm follows :
#include <immintrin.h>
namespace rdrw {
#pragma managed(push,off)
unsigned long long getRdRand() {
unsigned long long val = 0;
while (!_rdrand64_step(&val));
return val;
}
#pragma managed(pop)
public ref class Random abstract sealed
{
public:
// Returns a random 64 bit unsigned integer
static unsigned long long Next() {
return getRdRand();
}
// Return a random unsigned integer between 0 and max-1 (inclusive)
static unsigned long long Next(unsigned long long max) {
unsigned long long nb = max - 1;
unsigned long long mask = 1;
unsigned long long draw = 0;
if (max <= 1)
return 0;
// Create a bitmask that's at least as big as the biggest acceptable value
while ((nb&mask) != nb)
{
mask <<= 1;
mask |= 1;
}
do
{
// Throw unnecessary bits
draw = Next() & mask;
} while (draw>nb);
return draw;
}
// return a random unsigned integer between min and max-1 inclusive
static unsigned long long Next(unsigned long long min, unsigned long long max) {
if (max == min)
return min;
if (max < min)
return 0;
unsigned long long diff = max - min;
return Next(diff) + min;
}
};
}
Thanks for your insights !
Using a C# script in the Unity3D game engine to control a HLSL compute shader, I'm trying to generate pseudo random numbers on the GPU and store them in a Texture2D. Following along with
GPU Gems 3 Hybrid Tausworthe method
and another thread Pseudo Random Number Generation on the GPU, I've come across an issue.
The problem:
the resulting texture appears to be one solid color. If I run the shader multiple times, I get a different solid color texture result every time, but the entire texture is the one color.
Compute shader code
#pragma kernel CSMain
RWTexture2D<float4> result; // 256 resolution texture to write to
uint4 seed; //four uniform random numbers generated on the CPU in a C# script
struct RandomResult
{
uint4 state;
float value;
};
uint TausStep(uint z, int S1, int S2, int S3, uint M)
{
uint b = (((z << S1) ^ z) >> S2);
return ((z & M) << S3) ^ b;
}
uint LCGStep(uint z, uint A, uint C)
{
return A * z + C;
}
RandomResult HybridTaus(uint4 state)
{
state.x = TausStep(state.x, 13, 19, 12, 4294967294);
state.y = TausStep(state.y, 2, 25, 4, 4294967288);
state.z = TausStep(state.z, 3, 11, 17, 4294967280);
state.w = LCGStep(state.w, 1664525, 1013904223);
RandomResult rand;
rand.state = state;
rand.value = 2.3283064365387e-10 * (state.x ^ state.y ^ state.z ^ state.w);
return rand;
}
[numthreads(8, 8, 1)]
void CSMain(uint3 id)
{
result[id.xy] = HybridTaus(seed).value;
}
Do I need to save the state on the gpu? If so, how would I do that? Do I need to deallocate the memory afterwards?
I tried to assign the result of the HybridTaus() function to seed in hopes that it would use the new value in the following HybridTaus(seed) call to see if that would make a difference. I also tried to add unique arbitrary numbers based on the thread id, which is the id parameter. This gave some improved results, but I suspect the randomness is only as good as I can make it, coming from maths performed on the thread ids and not effectively from the random number generator.
[numthreads(8, 8, 1)]
void CSMain(uint3 id)
{
//first thing I tried
//RandomResult rand = HybridTaus(seed);
//seed = rand.state; // re-assign seed with the new state
//result[id.xy] = rand.value;
//second thing I tried
RandResult rand = HybridTaus(seed * uint4(id.x*id.y*id.x*id.y,
id.x*id.y/id.x*id.y,
id.x*id.y+id.x*id.y,
id.x*id.y-id.x*id.y));
result[id.xy] = rand.value;
}
First of all, I don't know about the algo you posted but i found this simple algorithm online for generating random numbers on the gpu. Here seed is a 32 bit uint.
uint wang_hash(uint seed)
{
seed = (seed ^ 61) ^ (seed >> 16);
seed *= 9;
seed = seed ^ (seed >> 4);
seed *= 0x27d4eb2d;
seed = seed ^ (seed >> 15);
return seed;
}
Now in most cases this is sufficient, you can pass your compute shader's Local invocation ID as that is unique and get a random number per thread or per invocation. However if you need multiple random numbers per invocation (for example you have a loop or a nested loop) this wasn't working as the seed remains the same. So i messed the function a little bit and came up with this
uint wang_hash(uint seed)
{
seed = seed + 76.897898 * 48.789789 * cos(x) * sin(y) * 20.79797
seed = (seed ^ 61) ^ (seed >> 16);
seed *= 9;
seed = seed ^ (seed >> 4);
seed *= 0x27d4eb2d;
seed = seed ^ (seed >> 15);
return seed;
}
Here x and y are my nested for loops variables. And this works for me. Now you can get multiple random numbers per invocation.
In your case however, I don't think you need the latter one. If I understood correct you just need to store a random number for every texel so you can try the first one and use the unique local invocation ID to get random numbers for every texel value.
This question already has answers here:
bitwise most significant set bit
(10 answers)
Closed 9 years ago.
A friend of mine was asked at an interview the following question: "Given a binary number, find the most significant bit". I immediately thought of the following solution but am not sure if it is correct.
Namely, divide the string into two parts and convert both parts into decimal. If the left-subarray is 0 in decimal then the do binary search in the right subarray, looking for 1.
That is my other question. Is the most significant bit, the left-most 1 in a binary number? Can you show me an example when a 0 is the most significant bit with an example and explanation.
EDIT:
There seems to be a bit of confusion in the answers below so I am updating the question to make it more precise. The interviewer said "you have a website that you receive data from until the most significant bit indicates to stop transmitting data" How would you go about telling the program to stop the data transfer"
You could also use bit shifting. Pseudo-code:
number = gets
bitpos = 0
while number != 0
bitpos++ # increment the bit position
number = number >> 1 # shift the whole thing to the right once
end
puts bitpos
if the number is zero, bitpos is zero.
Finding the most significant bit in a word (i.e. calculating log2 with rounding down) by using only C-like language instructions can be done by using a rather well-known method based on De Bruijn sequences. For example, for a 32-bit value
unsigned ulog2(uint32_t v)
{ /* Evaluates [log2 v] */
static const unsigned MUL_DE_BRUIJN_BIT[] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
return MUL_DE_BRUIJN_BIT[(v * 0x07C4ACDDu) >> 27];
}
However, in practice more simple methods (like unrolled binary search) usually work just as well or better.
The edited question is really quite different, though not very clear. Who are "you"? The website or the programmer of the program that reads data from the website? If you're the website, you make the program stop by sending a value (but what, a byte, probably?) with its most-significant bit set. Just OR or ADD that bit in. If you're the programmer, you test the most-significant bit of the values you receive, and stop reading when it becomes set. For unsigned bytes, you could do the test like
bool stop = received_byte >= 128;
or
bool stop = received_byte & 128;
For signed bytes, you could use
bool stop = received_byte < 0;
or
bool stop = received_byte & 128;
If you're not reading bytes but, say, 32bit words, the 128 changes to (1 << 31).
This is one approach (not necessarily the most efficient, though, especially if your platform has a single-instruction solution to find-first-one or count-leading-zeros or something similar), assuming twos complement signed integers and a 32-bit integer width.
int mask = (int)(1U<<31); // signed integer with only bit 32 set
while (! n & mask) // n is the int we're testing against
mask >>= 1; // take advantage of sign fill on right shift of negative number
mask = mask ^ (mask << 1) // isolate first bit that matched with n
If you want the bit position of that first one, simply add a integer counter that starts at 31 and gets decremented on each loop iteration.
One downside to this is if n == 0, it's an infinite loop, so test for zero beforehand.
If you are interested in a C/C++ solution you can have a look at the book "Matters Computational" by Jörg Arndt where you have these functions defined in section "1.6.1 Isolating the highest one and finding its index":
static inline ulong highest_one_idx(ulong x)
// Return index of highest bit set.
// Return 0 if no bit is set.
{
#if defined BITS_USE_ASM
return asm_bsr(x);
#else // BITS_USE_ASM
#if BITS_PER_LONG == 64
#define MU0 0x5555555555555555UL // MU0 == ((-1UL)/3UL) == ...01010101_2
#define MU1 0x3333333333333333UL // MU1 == ((-1UL)/5UL) == ...00110011_2
#define MU2 0x0f0f0f0f0f0f0f0fUL // MU2 == ((-1UL)/17UL) == ...00001111_2
#define MU3 0x00ff00ff00ff00ffUL // MU3 == ((-1UL)/257UL) == (8 ones)
#define MU4 0x0000ffff0000ffffUL // MU4 == ((-1UL)/65537UL) == (16 ones)
#define MU5 0x00000000ffffffffUL // MU5 == ((-1UL)/4294967297UL) == (32 ones)
#else
#define MU0 0x55555555UL // MU0 == ((-1UL)/3UL) == ...01010101_2
#define MU1 0x33333333UL // MU1 == ((-1UL)/5UL) == ...00110011_2
#define MU2 0x0f0f0f0fUL // MU2 == ((-1UL)/17UL) == ...00001111_2
#define MU3 0x00ff00ffUL // MU3 == ((-1UL)/257UL) == (8 ones)
#define MU4 0x0000ffffUL // MU4 == ((-1UL)/65537UL) == (16 ones)
#endif
ulong r = (ulong)ld_neq(x, x & MU0)
+ ((ulong)ld_neq(x, x & MU1) << 1)
+ ((ulong)ld_neq(x, x & MU2) << 2)
+ ((ulong)ld_neq(x, x & MU3) << 3)
+ ((ulong)ld_neq(x, x & MU4) << 4);
#if BITS_PER_LONG > 32
r += ((ulong)ld_neq(x, x & MU5) << 5);
#endif
return r;
#undef MU0
#undef MU1
#undef MU2
#undef MU3
#undef MU4
#undef MU5
#endif
}
where asm_bsr is implemented depending on your processor architecture
// i386
static inline ulong asm_bsr(ulong x)
// Bit Scan Reverse: return index of highest one.
{
asm ("bsrl %0, %0" : "=r" (x) : "0" (x));
return x;
}
or
// AMD64
static inline ulong asm_bsr(ulong x)
// Bit Scan Reverse
{
asm ("bsrq %0, %0" : "=r" (x) : "0" (x));
return x;
}
Go here for the code: http://jjj.de/bitwizardry/bitwizardrypage.html
EDIT:
This is the definition in the source for function ld_neq:
static inline bool ld_neq(ulong x, ulong y)
// Return whether floor(log2(x))!=floor(log2(y))
{ return ( (x^y) > (x&y) ); }
I don't know it this is too much tricky :)
I would convert the binary number to dec and then I would return the logaritm base 2 of the number directly (converted from float to int).
The solution is the (returned number + 1) bit starting from the right.
To your answer as far as I know its the left-most 1
I tihnk this is kind of a trick question. The most significant bit is always going to be a 1 :-). If interviewers like lateral thinking, that answer should be a winner!
I am trying to use CURAND library to generate random numbers which are completely independent of each other from 0 to 100. Hence I am giving time as seed to each thread and specifying the "id = threadIdx.x + blockDim.x * blockIdx.x" as sequence and offset .
Then after getting the random number as float, I multiply it by 100 and take its integer value.
Now, the problem I am facing is that its getting the same random number for the thread [0,0] and [0,1], no matter how many times I run the code which is 11. I am unable to understand what am I doing wrong. Please help.
I am pasting my code below:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include<curand_kernel.h>
#include "util/cuPrintf.cu"
#include<time.h>
#define NE WA*HA //Total number of random numbers
#define WA 2 // Matrix A width
#define HA 2 // Matrix A height
#define SAMPLE 100 //Sample number
#define BLOCK_SIZE 2 //Block size
__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
int id = threadIdx.x + blockIdx.x + blockDim.x;
curand_init ( seed, id , id, &state[id] );
}
__global__ void generate( curandState* globalState, float* randomMatrix )
{
int ind = threadIdx.x + blockIdx.x * blockDim.x;
if(ind < NE){
curandState localState = globalState[ind];
float stopId = curand_uniform(&localState) * SAMPLE;
cuPrintf("Float random value is : %f",stopId);
int stop = stopId ;
cuPrintf("Random number %d\n",stop);
for(int i = 0; i < SAMPLE; i++){
if(i == stop){
float random = curand_normal( &localState );
cuPrintf("Random Value %f\t",random);
randomMatrix[ind] = random;
break;
}
}
globalState[ind] = localState;
}
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
// 1. allocate host memory for matrix A
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float* ) malloc(mem_size_A);
time_t t;
// 2. allocate device memory
float* d_A;
cudaMalloc((void**) &d_A, mem_size_A);
// 3. create random states
curandState* devStates;
cudaMalloc ( &devStates, size_A*sizeof( curandState ) );
// 4. setup seeds
int n_blocks = size_A/BLOCK_SIZE;
time(&t);
printf("\nTime is : %u\n",(unsigned long) t);
setup_kernel <<< n_blocks, BLOCK_SIZE >>> ( devStates, (unsigned long) t );
// 4. generate random numbers
cudaPrintfInit();
generate <<< n_blocks, BLOCK_SIZE >>> ( devStates,d_A );
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
// 5. copy result from device to host
cudaMemcpy(h_A, d_A, mem_size_A, cudaMemcpyDeviceToHost);
// 6. print out the results
printf("\n\nMatrix A (Results)\n");
for(int i = 0; i < size_A; i++)
{
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)
printf("\n");
}
printf("\n");
// 7. clean up memory
free(h_A);
cudaFree(d_A);
}
Output that I get is :
Time is : 1347857063
[0, 0]: Float random value is : 11.675105[0, 0]: Random number 11
[0, 0]: Random Value 0.358356 [0, 1]: Float random value is : 11.675105[0, 1]: Random number 11
[0, 1]: Random Value 0.358356 [1, 0]: Float random value is : 63.840496[1, 0]: Random number 63
[1, 0]: Random Value 0.696459 [1, 1]: Float random value is : 44.712799[1, 1]: Random number 44
[1, 1]: Random Value 0.735049
There are a few things wrong here, I'm addressing the first ones here to get you started:
General points
Please check the return values of all CUDA API calls, see here for more info.
Please run cuda-memcheck to check for obvious things like out-of-bounds accesses.
Specific points
When allocating space for the RNG state, you should have space for one state per thread (not one per matrix element as you have now).
Your thread ID calculation in setup_kernel() is wrong, should be threadIdx.x + blockIdx.x * blockDim.x (* instead of +).
You use the thread ID as the sequence number as well as the offset, you should just set the offset to zero as described in the cuRAND manual:
For the highest quality parallel pseudorandom number generation, each
experiment should be assigned a unique seed. Within an experiment,
each thread of computation should be assigned a unique sequence
number.
Finally you're running two threads per block, that's incredibly inefficient. Check out the CUDA C Programming Guide, in the "maximize utilization" section for more information, but you should be looking to launch a multiple of 32 threads per block (e.g. 128, 256) and a large number of blocks (e.g. tens of thousands). If you're problem is small then consider running multiple problems at once (either batched in a single kernel launch or as kernels in different streams to get concurrent execution).