For a noise shader I'm looking for a pseudo random number algorithm with 3d vector argument, i.e.,
for every integer vector it returns a value in [0,1].
It should be as fast as possible without producing visual artifacts and giving the same results on every GPU.
Two variants (pseudo code) I found are
rand1(vec3 (x,y,z)){
return xorshift32(x ^ xorshift32(y ^ xorshift32(z)));
}
which already uses 20 arithmetic operations and still has to be casted and normalized and
rand2(vec3 v){
return fract(sin(dot(v, vec3(12.9898, 78.233, ?))) * 43758.5453);
};
which might be faster but uses sin causing precision problems and different results on different GPU's.
Do you know any other algorithms requiring less arithmetic operations?
Thanks in advance.
Edit: Another standard PRNG is XORWOW as implemented in C as
xorwow() {
int x = 123456789, y = 362436069, z = 521288629, w = 88675123, v = 5783321, d = 6615241;
int t = (x ^ (x >> 2));
x = y;
y = z;
z = w;
w = v;
v = (v ^ (v << 4)) ^ (t ^ (t << 1));
return (d += 362437) + v;
}
Can we rewrite it to fit in our context?
I've been trying to create a generalized Gradient Noise generator (which doesn't use the hash method to get gradients). The code is below:
class GradientNoise {
std::uint64_t m_seed;
std::uniform_int_distribution<std::uint8_t> distribution;
const std::array<glm::vec2, 4> vector_choice = {glm::vec2(1.0, 1.0), glm::vec2(-1.0, 1.0), glm::vec2(1.0, -1.0),
glm::vec2(-1.0, -1.0)};
public:
GradientNoise(uint64_t seed) {
m_seed = seed;
distribution = std::uniform_int_distribution<std::uint8_t>(0, 3);
}
// 0 -> 1
// just passes the value through, origionally was perlin noise activation
double nonLinearActivationFunction(double value) {
//return value * value * value * (value * (value * 6.0 - 15.0) + 10.0);
return value;
}
// 0 -> 1
//cosine interpolation
double interpolate(double a, double b, double t) {
double mu2 = (1 - cos(t * M_PI)) / 2;
return (a * (1 - mu2) + b * mu2);
}
double noise(double x, double y) {
std::mt19937_64 rng;
//first get the bottom left corner associated
// with these coordinates
int corner_x = std::floor(x);
int corner_y = std::floor(y);
// then get the respective distance from that corner
double dist_x = x - corner_x;
double dist_y = y - corner_y;
double corner_0_contrib; // bottom left
double corner_1_contrib; // top left
double corner_2_contrib; // top right
double corner_3_contrib; // bottom right
std::uint64_t s1 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y) + m_seed);
std::uint64_t s2 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s3 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s4 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y) + m_seed);
// each xy pair turns into distance vector from respective corner, corner zero is our starting corner (bottom
// left)
rng.seed(s1);
corner_0_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y});
rng.seed(s2);
corner_1_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y - 1});
rng.seed(s3);
corner_2_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y - 1});
rng.seed(s4);
corner_3_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y});
double u = nonLinearActivationFunction(dist_x);
double v = nonLinearActivationFunction(dist_y);
double x_bottom = interpolate(corner_0_contrib, corner_3_contrib, u);
double x_top = interpolate(corner_1_contrib, corner_2_contrib, u);
double total_xy = interpolate(x_bottom, x_top, v);
return total_xy;
}
};
I then generate an OpenGL texture to display with like this:
int width = 1024;
int height = 1024;
unsigned char *temp_texture = new unsigned char[width*height * 4];
double octaves[5] = {2,4,8,16,32};
for( int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
double d_noise = 0;
d_noise += temp_1.noise(j/octaves[0], i/octaves[0]);
d_noise += temp_1.noise(j/octaves[1], i/octaves[1]);
d_noise += temp_1.noise(j/octaves[2], i/octaves[2]);
d_noise += temp_1.noise(j/octaves[3], i/octaves[3]);
d_noise += temp_1.noise(j/octaves[4], i/octaves[4]);
d_noise/=5;
uint8_t noise = static_cast<uint8_t>(((d_noise * 128.0) + 128.0));
temp_texture[j*4 + (i * width * 4) + 0] = (noise);
temp_texture[j*4 + (i * width * 4) + 1] = (noise);
temp_texture[j*4 + (i * width * 4) + 2] = (noise);
temp_texture[j*4 + (i * width * 4) + 3] = (255);
}
}
Which give good results:
But gprof is telling me that the Mersenne twister is taking up 62.4% of my time and growing with larger textures. Nothing else individual takes any where near as much time. While the Mersenne twister is fast after initialization, the fact that I initialize it every time I use it seems to make it pretty slow.
This initialization is 100% required for this to make sure that the same x and y generates the same gradient at each integer point (so you need either a hash function or seed the RNG each time).
I attempted to change the PRNG to both the linear congruential generator and Xorshiftplus, and while both ran orders of magnitude faster, they gave odd results:
LCG (one time, then running 5 times before using)
Xorshiftplus
After one iteration
After 10,000 iterations.
I've tried:
Running the generator several times before utilizing output, this results in slow execution or simply different artifacts.
Using the output of two consecutive runs after initial seed to seed the PRNG again and use the value after wards. No difference in result.
What is happening? What can i do to get faster results that are of the same quality as the mersenne twister?
OK BIG UPDATE:
I don't know why this works, I know it has something to do with the prime number utilized, but after messing around a bit, it appears that the following works:
Step 1, incorporate the x and y values as seeds separately (and incorporate some other offset value or additional seed value with them, this number should be a prime/non trivial factor)
Step 2, Use those two seed results into seeding the generator again back into the function (so like geza said, the seeds made were bad)
Step 3, when getting the result, instead of using modulo number of items (4) trying to get, or & 3, modulo the result by a prime number first then apply & 3. I'm not sure if the prime being a mersenne prime matters or not.
Here is the result with prime = 257 and xorshiftplus being used! (note I used 2048 by 2048 for this one, the others were 256 by 256)
LCG is known to be inadequate for your purpose.
Xorshift128+'s results are bad, because it needs good seeding. And providing good seeding defeats the whole purpose of using it. I don't recommend this.
However, I recommend using an integer hash. For example, one from Bob's page.
Here's a result of the first hash of that page, it looks OK to me, and it is fast (I think it is much faster than Mersenne Twister):
Here's the code I've written to generate this:
#include <cmath>
#include <stdio.h>
unsigned int hash(unsigned int a) {
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
unsigned int ivalue(int x, int y) {
return hash(y<<16|x)&0xff;
}
float smooth(float x) {
return 6*x*x*x*x*x - 15*x*x*x*x + 10*x*x*x;
}
float value(float x, float y) {
int ix = floor(x);
int iy = floor(y);
float fx = smooth(x-ix);
float fy = smooth(y-iy);
int v00 = ivalue(iy+0, ix+0);
int v01 = ivalue(iy+0, ix+1);
int v10 = ivalue(iy+1, ix+0);
int v11 = ivalue(iy+1, ix+1);
float v0 = v00*(1-fx) + v01*fx;
float v1 = v10*(1-fx) + v11*fx;
return v0*(1-fy) + v1*fy;
}
unsigned char pic[1024*1024];
int main() {
for (int y=0; y<1024; y++) {
for (int x=0; x<1024; x++) {
float v = 0;
for (int o=0; o<=9; o++) {
v += value(x/64.0f*(1<<o), y/64.0f*(1<<o))/(1<<o);
}
int r = rint(v*0.5f);
pic[y*1024+x] = r;
}
}
FILE *f = fopen("x.pnm", "wb");
fprintf(f, "P5\n1024 1024\n255\n");
fwrite(pic, 1, 1024*1024, f);
fclose(f);
}
If you want to understand, how a hash function work (or better yet, which properties a good hash have), check out Bob's page, for example this.
You (unknowingly?) implemented a visualization of PRNG non-random patterns. That looks very cool!
Except Mersenne Twister, all your tested PRNGs do not seem fit for your purpose. As I have not done further tests myself, I can only suggest to try out and measure further PRNGs.
The randomness of LCGs are known to be sensitive to the choice of their parameters. In particular, the period of a LCG is relative to the m parameter - at most it will be m (your prime factor) & for many values it can be less.
Similarly, the careful parameters selection is required to get a long period from Xorshift PRNGs.
You've noted that some PRNGs give good procedural generation results while other do not. In order to isolate the cause, I would factor out the proc gen stuff & examine the PRNG output directly. An easy way to visualize the data is to build a grey scale image where each pixel value is a (possibly scaled) random value. For image based stuff, I find this to be an easy way to find stuff that may lead to visual artifacts. Any artifacts you see with this are likely to cause issues with your proc gen output.
Another option is to try something like the Diehard tests. If the aforementioned image test failed to reveal any problems, I might use this just to be sure my PRNG techniques were trustworthy.
Note that your code seeds the PRNG, then generates one pseudorandom number from the PRNG. The reason for the nonrandomness in xorshift128+ that you discovered is that xorshift128+ simply adds the two halves of the seed (and uses the result mod 264 as the generated number) before changing its state (review its source code). This makes that PRNG considerably different from a hash function.
What you see is the practical demonstration of quality of PRNG. Mersenne Twister is one of the best PRNGs with good performance, it passes DIEHARD tests. One should know that generating a random numbers is not an easy computational task, so looking for a better performance will inevitably result in poor quality. LCG is known to be simplest and worst PRNG ever designed and it clearly shows two-dimensional correlation as in your picture. The quality of Xorshift generators largely depend on bitness and parameters. They are definitely worse than Mersenne Twister, but some (xorshift128+) may work good enough to pass BigCrush battery of TestU01 tests.
In other words, if you are making an important physical modelling numerical experiment, you better continue to use Mersenne Twister as known to be a good trade-off between speed and quality and it comes in many standard libraries. On a less important case you may try to use xorshift128+ generator. For an ultimate results you need to use cryptographical-quality PRNG (none of mentioned here may be used for cryptographical purposes).
So I need to solve for the linear system (A + i * mu * I) x = b, where A is dense Hermitian matrix (6x6 complex numbers), mu is a real scalar, and I is identity matrix.
Obviously if mu=0, I should just use Cholesky and be done with it. With non-zero mu though, the matrix ceases to be Hermitian and Cholesky fails.
Possible solutions:
Solve normal operator using Cholesky and multiply by the conjugate
Solve directly using LU decomposition
This is in a time-critical performance routine, where I need the most efficient method. Any thoughts on the optimum approach, or if there is a specific method for solving the above shifted Hermitian system?
This is to be deployed in a CUDA kernel, where I'll be solving many linear systems in parallel, e.g., one per thread. This means that I need a solution that minimizes thread divergence. Given the small system size, pivoting can be ignored without too much issue: this removes a possible source of thread divergence. I've already implemented an in-place Cholesky normal method, and while it's working ok, the performance isn't great in double precision.
I can't vouch for the stability of the method below, but if your matrix is reasonably well conditioned, it might be worth a try.
We want to solve
A*X = B
If we pick out the first row and column, say
A = ( a y )
( z A_ )
X = ( x )
( X_)
B = ( b )
( B_ )
The requirement is
a*x + y*X_ = b
z*x + A_*X_ = B_
so
x = (b - y*X_ )/a
(A_ - zy/a) * X_ = B_ - (b/a)z
The solution goes in two stages. First use the second equation to transform A and b, then use the second to form the solution x.
In C:
static void nhsol( int dim, complx* A, complx* B, complx* X)
{
int i, j, k;
complx a, fb, fa;
complx* z;
complx* acol;
// update A and B
for( i=0; i<dim; ++i)
{ z = A + i*dim;
a = z[i];
// update B
fb = B[i]/a;
for( j=i+1; j<dim; ++j)
{ B[j] -= fb*z[j];
}
// update A
for( k=i+1; k<dim; ++k)
{ acol = A + k*dim;
fa = acol[i]/a;
for( j=i+1; j<dim; ++j)
{ acol[j] -= fa*z[j];
}
}
}
// compute x
i = dim-1;
X[i] = B[i] / A[i+dim*i];
while( --i>=0)
{
complx s = B[i];
for( j=i+1; j<dim; ++j)
{ s -= A[i+j*dim]*X[j];
}
X[i] = s/A[i+i*dim];
}
}
where
typedef _Complex double complx;
If code space is not at a premuim it might be worth unrolling the loops. Personally I would do this by writing a program whose sole job was to write the code.
I've taken code from "Midpoint displacement algorithm example", cleaned it up a bit, and resuited it to work as a 1D linear terrain generator. Below is my new version of the doMidpoint() method:
public boolean setMidpointDisplacement(int x1, int x2) {
// Exit recursion if points are next to eachother
if (x2 - x1 < 2) {
return false;
}
final int midX = (x1 + x2) / 2;
final int dist = x2 - x1;
final int distHalf = dist / 2;
final int y1 = map[x1];
final int y2 = map[x2];
final int delta = random.nextInt(dist) - distHalf; // +/- half the distance
final int sum = y1 + y2;
map[midX] = (sum + delta) / 2; // Sets the midpoint
// Divide and repeat
setMidpointDisplacement(x1, midX);
setMidpointDisplacement(midX, x2);
return true;
}
The code seems to work well and produces workable terrain (you can see how I've tested it, with a rudimentary GUI)
After reading "Generating Random Fractal Terrain" and "Mid Point Displacement Algorithm", my question is:
How can I identify the 'roughness constant' implicitly utilized by this code? And then, how can I change it?
Additionally, and this may or may not be directly related to my major question, but I've noticed that the code adds the sum of the y-values to the "delta" (change amount) and divides this by 2 -- although this is the same as averaging the sum and then adding delta/2. Does this have any bearing on the 'roughness constant'? I'm thinking that I could do
map[midX] = sum/2 + delta/K;
and K would now be representative of the 'roughness constant', but I'm not sure if this is accurate or not, since it seems to allow me to control smoothing but doesn't directly control "how much the random number range is reduced each time through the loop" as defined by "Generating Random Fractal Terrain".
Like I've said before, I ported the 2D MDP noise generator I found into a 1D version -- but I'm fairly certain I did it accurately, so that is not the source of any problems.
How can I identify the 'roughness constant' implicitly utilized by this code?
In the cited, roughness is the amount you diminish the max random displacement. As your displacement is random.nextInt(dist) = dist*random.nextDouble(), your dist = x2-x1 and you go from one recursion step to the other with half of this dist, it follows that the roughness == 1 (in the cited terminology)
And then, how can I change it?
public boolean setMidpointDisplacement(int x1, int x2, int roughness) {
// Exit recursion if points are next to eachother
if (x2 - x1 < 2) {
return false;
}
// this is 2^-roughness as per cited
// you can pass it precalculated as a param, using it as such here
// is only to put it into a relation with the cited
double factor=1.0/(1<<roughness);
final int midX = (x1 + x2) / 2;
final int dist = x2 - x1;
final int distHalf = dist / 2;
final int y1 = map[x1];
final int y2 = map[x2];
// and you apply it here. A cast will be necessary though
final int delta = factor*(random.nextInt(dist) - distHalf); // +/- half the distance
final int sum = y1 + y2;
map[midX] = (sum + delta) / 2; // Sets the midpoint
// Divide and repeat
setMidpointDisplacement(x1, midX, roughness);
setMidpointDisplacement(midX, x2, roughness);
return true;
}
Additionally, and this may or may not be directly related to my major question, but I've noticed that the code adds the sum of the y-values to the "delta" (change amount) and divides this by 2
Their way has the advantage of doing it with a single division. As you work with ints, the accumulated truncation errors will be smaller with a single div (not to mention slightly faster).
I'm currently working on processing an image (2D array) that is read in as a binary file and should be 512x4096 with CUDA. My question is how am I to handle the specific indices (with respect to blocks and threads) when everything is really stored as a 1D array. As an example, I'm trying to create a function that circshifts everythin to the right by . My code is
//CircShift, without scaling
__global__ void circShift(cufftComplex* input, cufftComplex* output, int numK, int numA) {
int w = blockDim.x * blockIdx.x + threadIdx.x;
int h = blockDim.y * blockIdx.y + threadIdx.y;
if ((h < numA) && (w < numK)) {
int idx = h * numK + w;
if (w+1<numK) {
output[h*numK+w+1].x = input[idx].x;
output[h*numK+w+1].y = input[idx].y;
} else {
output[h*numK].x = input[idx].x;
output[h*numK].y = input[idx].y;
}
}
}
My current blockdim is (256, 64) and my threaddim is (16,8).
I am wondering if this is the correct way to implement something like this in terms of indexing. Would my w be equivalent to the column #, and would my h be equivalent to the row number. Let us say I am using something like python and load the image as a 2D array M. Is indexing in CUDA via h*numK+w correct for trying to access M[h][w].
Your indexing seems correct assuming the image matrix has a row major ordering: w will give you the absolute x coordinate in the image (or column number) and h*numK+w is equivalent to y * imageWidth + x which is a common idiom in accessing cells in CUDA. Just make sure your entire image grid is covered (you just wrote 512x4096, I assume 4096 is the image width).
As a sidenote you should access your cufftComplex elements once per thread instead of duplicating your operator[] access code (also as a refactoring point).