OpenCL for-loop doing strange things - for-loop

I'm currently implementing terrain generation in OpenCL using layered octaves of noise and I've stumbled upon this problem:
float multinoise2d(float2 position, float scale, int octaves, float persistence)
{
float result = 0.0f;
float sample = 0.0f;
float coefficient = 1.0f;
for(int i = 0; i < octaves; i++){
// get a sample of a simple signed perlin noise
sample = sgnoise2d(position/scale);
if(i > 0){
// Here is the problem:
// Implementation A, this works correctly.
coefficient = pown(persistence, i);
// Implementation B, using this only the first
// noise octave is visible in the terrain.
coefficient = persistence;
persistence = persistence*persistence;
}
result += coefficient * sample;
scale /= 2.0f;
}
return result;
}
Does OpenCL parallelize for-loops, leading to synchronization issues here or am I missing something else?
Any help is appreciated!

the problem of your code is with the lines
coefficient = persistence;
persistence = persistence*persistence;
It should be changed to
coefficient = coefficient *persistence;
otherwise on every iteration
the first coeficient grows by just persistence
pow(persistence, 1) ; pow(persistence, 2); pow(persistence, 3) ....
However the second implementation goes
pow(persistence, 1); pow(persistence, 2); pow(persistence, 4); pow(persistence, 8) ......
soon "persistence" will run above the limit for float and you will get zeros (or undefined behavior) in your answer.
EDIT
Two more things
Accumulation (implementation 2) is not a good idea, specially with real numbers and with algorithms that require accuracy. You might be losing a small fraction of you information every time you accumulate on "persistence" (e.g due to rounding). Prefer direct calculation (1st implementation) over accumulation whenever you can. (plus if this was Serial the 2nd implementation will be readily parallelizable.)
If you are working with AMD OpenCL pay attention to the pow() functions. I have had problems with those on multiple machines on multiple occasions. The functions seem to hang sometimes for no reason. Just FYI.

I'm assuming this is some kind of utility method that is called in your CL kernel. Vivek is correct in his comment above: OpenCL does not parallelize your code for you. You have to leverage OpenCL's facilities for dividing your problem into data-parallel chunks.
Also, I don't see a potential synchronization issue in the above code. All of your variables are in work-item private memory space.

Related

"Intrinsics" possible on GPU on OpenGL?

I had this idea for something "intrinsic-like" on OpenGL, but googeling around brought no results.
So basically I have a Compute Shader for calculating the Mandelbrot set (each thread does one pixel). Part of my main-function in GLSL looks like this:
float XR, XI, XR2, XI2, CR, CI;
uint i;
CR = float(minX + gl_GlobalInvocationID.x * (maxX - minX) / ResX);
CI = float(minY + gl_GlobalInvocationID.y * (maxY - minY) / ResY);
XR = 0;
XI = 0;
for (i = 0; i < MaxIter; i++)
{
XR2 = XR * XR;
XI2 = XI * XI;
XI = 2 * XR * XI + CI;
XR = XR2 - XI2 + CR;
if ((XR * XR + XI * XI) > 4.0)
{
break;
}
}
So my thought was using vec4's instead of floats and so doing 4 calculations/pixels at once and hopefully get a 4x speed-boost (analog to "real" CPU-intrinsics). But my code seems to run MUCH slower than the float-version. There are still some mistakes in there (if anyone would still like to see the code, please say so), but I don't think they are what slows down the code. Before I try around for ages, can anybody tell me right away, if this endeavour is futile?
CPUs and GPUs work quite differently.
CPUs need explicit vectorization in the machine code, either coded manually by the programmer (through what you call 'CPU-intrisnics') or automatically vectorized by the compiler.
GPUs, on the other hand, vectorize by means of running multiple invocations of your shader (aka kernel) on their cores in parallel.
AFAIK, on modern GPUs, additional vectorization within a thread is neither needed nor supported: instead of manufacturing a single core that can add 4 floats per clock (for example), it's more beneficial to have four times as many simpler cores, each of them being able to add a single float per clock. This way you still get the same peak FLOPS for the entire chip, while at the same time enabling full utilization of the circuitry even when the individual shader code cannot be vectorized. The thing is that most code, by means of necessity, will have at least some scalar computations in it.
The bottom line is: it's likely that your code already squeezes the most out of the GPU as possible for this specific task.

How inefficient is my ray-box-intersection algorithm?

I am experimenting a little bit with shaders and the calculation of a collision between ray-box which is done following way:
inline bool hitsCube(in Ray ray, in Cube cube,
out float tMin, out float tMax,
out float3 signMin, out float3 signMax)
{
float3 biggerThan0 = ray.odir > 0; // ray.odir = (1.0/ray.dir)
float3 lessThan0 = 1.0f - biggerThan0;
float3 tMinXYZ = cube.center + biggerThan0 * cube.minSize + lessThan0 * cube.maxSize;
float3 tMaxXZY = cube.center + biggerThan0 * cube.maxSize + lessThan0 * cube.minSize;
float3 rMinXYZ = (tMinXYZ - ray.origin) * ray.odir;
float3 rMaxXYZ = (tMaxXZY - ray.origin) * ray.odir;
float minV = max(rMinXYZ.x, max(rMinXYZ.y, rMinXYZ.z));
float maxV = min(rMaxXYZ.x, min(rMaxXYZ.y, rMaxXYZ.z));
tMin = minV;
tMax = maxV;
signMin = (rMinXYZ == minV) * lessThan0; // important calculation for another algorithm, but no context provided here
signMax = (rMaxXYZ == maxV) * lessThan0;
return maxV > minV * (minV + maxV >= 0); // last multiplication makes sure the origin of the ray is outside the cube
}
Considering this function could be called inside a hlsl-shader many, many times (for some pixels lets say at least 200/300 times): Is my implementation of the collision logic inefficient?
Not rally a easily answerable "question", and hard to say without knowing all else that's going on around it, but just a few random thoughts:
a) if you're really interested in knowing that this could would look like on the GPU I'd suggest "porting" that to a CUDA kernel, then using CUDA to generate PTX and SASS for a modern GPU (say, sm75 for turing or sm86 for ampere); then compare two or three variants of that in SASS output.
b) the "converting logic to multiplications" might give you less than you think - if the logic isn't too complicated there's a good change you might end up with a few predicates and not much warp divergence at all, so might not be too bad. Only way to tell is look at PTX and/or SASS output, see 'a'.
c) your formulation of tMinXYZ/tMaxXYZ is (IMHO) unnecesarily complicated: just express it with min/max operations, which are really cheap on GPUs. Also see the respective chapter "ray/box intersection" in the ray tracing gems 2 book (which is free for download). Also more numerically stable btw.
d) re "lags... is my logic inefficient" - actual assembly "efficiency" will rarely have such gigantic effects; usually the culprit for noticeable "lags" is either memory stalls (hard to guess what's going on), or something going horribly wrong for other reasons (see next bullet).
e) just a hunch: I would check rays where some of the direction components are 0. In this case you're dividing by 0 (never a good idea), and in particular if this gets multiplied with 0.f (which in your case can happen) you'll get NaNs, and since "comparison with NaN is always false" you may end with cases where your traversal logic always goes down instead of skipping. Not the same as "efficiency" of your logic, but something to look out for. Good fix is to always change each ray.dir that's 0.f to 1e-6f or so.

Dot product vs Direct vector components sum performance in shaders

I'm writing CG shaders for advanced lighting calculation for game based on Unity. Sometimes it is needed to sum all vector components. There are two ways to do it:
Just write something like:
float sum = v.x + v.y + v.z;
Or do something like:
float sum = dot(v,float3(1,1,1));
I am really curious about what is faster and looks better for code style.
It's obvious that if we have same question for CPU calculations, the first simle way is much better. Because of:
a) There is no need to allocate another float(1,1,1) vector
b) There is no need to multiply every original vector "v" components by 1.
But since we do it in shader code, which runs on GPU, I belive there is some great hardware optimization for dot product function, and may be allocation of float3(1,1,1) will be translated in no allocation at all.
float4 _someVector;
void surf (Input IN, inout SurfaceOutputStandard o){
float sum = _someVector.x + _someVector.y + _someVector.z + _someVector.w;
// VS
float sum2 = dot(_someVector, float4(1,1,1,1));
}
Check this link.
Vec3 Dot has a cost of 3 cycles, while Scalar Add has a cost of 1.
Thus, in almost all platforms (AMD and NVIDIA):
float sum = v.x + v.y + v.z; has a cost of 2
float sum = dot(v,float3(1,1,1)); has a cost of 3
The first implementation should be faster.
Implementation of the Dot product in cg: https://developer.download.nvidia.com/cg/dot.html
IMHO difference is immeasurable, in 98% of the cases, but first one should be faster, because multiplication is a "more expensive" operation

Coherent Spherical Noise?

Does anyone know how I might be able to generate the following kind of noise?
Three inputs, three outputs
The outputs must always result in a vector of the same magnitude
If it receives the same input as some other time, it must return the same output
It must be continuous (best if it appears smooth, like perlin noise)
It must appear to be fairly random
EDIT: It would also be nice if it were isotropic, but that's not entirely necessary.
I've found a way, and it might not be very fast, but it does the job (this is c-like pseudocode designed to make porting to other languages easy).
vec3 sphereNoise(vec3 input, float radius)
{
vec3 result;
result.x = simplex(input.x, input.y); //could use perlin instead of simplex
result.y = simplex(input.y, input.z); //but I prefer simplex for its speed
result.z = simplex(input.z, input.x); //and its lack of directional artifacts
//uncomment the following line to make it a spherical-shell noise
//result.normalize();
result *= radius;
return result;
}

Implementing Bezier Curves

I am trying to implement Bezier Curves for an assignment. I am trying to move a ball (using bezier curves) by giving my function an array of key frames. The function should give me all the frames in between the key frames ... or control points ... but although I'm using the formula found on wikipedia... it is not really working :s
her's my code:
private void interpolate(){
float x,y,b, t = 0;
frames = new Frame[keyFrames.length];
for(int i =0;i<keyFrames.length;++i){
t+=0.001;
b = Bint(i,keyFrames.length,t);
x = b*keyFrames[i].x;
y = b*keyFrames[i].y;
frames[i] = new Frame(x,y);
}
}
private float Bint(int i, int n, float t){
float Cni = fact(n)/(fact(i) * fact(n-i));
return Cni * pow(1-t,n-i) * pow(t,i);
}
Also I've noticed that the frames[] array should be much bigger but I can't find any other text which is more programmer friendly
Thanks in advance.
There are lots of things that don't look quite right here.
Doing it this way, your interpolation will pass exactly through the first and last control points, but not through the others. Is that what you want?
If you have lots of key frames, you're using a very-high-degree polynomial for your interpolation. Polynomials of high degree are notoriously badly-behaved, you may get your position oscillating wildly in between the key frame positions. (This is one reason why the answer to question 1 should probably be no.)
Assuming for the sake of argument that you really do want to do this, your value of t should go from 0 at the start to 1 at the end. Do you happen to have exactly 1001 of these key frames? If not, you'll be doing the wrong thing.
Evaluating these polynomials with lots of calls to fact and pow is likely to be inefficient, especially if n is large.
I'm reluctant to go into much detail about what you should do without knowing more about the scope of your assignment -- it will do no one any good for Stack Overflow to do your homework for you! What have you already been told about Bezier curves? What exactly does your assignment ask you to do?
EDITED to add:
The simplest way to do interpolation using Bezier curves is probably this. Have one (cubic) Bezier curve between each pair of key-points. The endpoints (first and last control points) of each Bezier curve are those keypoints. You need two more control points. For motion to be smooth as you move through a given keypoint, you need (keypoint minus previous control point) = (next control point minus keypoint). So you're choosing a single vector at each keypoint, which will determine where the previous and subsequent control points go. As you move through each keypoint, you'll be moving in the direction of that vector, and the longer the vector is the faster you'll be moving. (If the vector is zero then your cubic Bezier degenerates into a simple straight-line path.)
Choosing that vector so that everything looks nice is highly nontrivial, but you probably aren't really being asked to do that at this stage. So something pretty simple will probably be good enough. You might, e.g., take the vector to be proportional to (next keypoint minus previous keypoint). You'll need to do something a bit different at the start and end of your path if you do that.
Finally got What I needed! Here's what I did:
private void interpolate() {
float t = 0;
float x,y,b;
for(int f =0;f<frames.length;f++) {
x=0;
y=0;
for(int i = 0; i<keyFrames.length; i++) {
b = Bint(i,keyFrames.length-1,map(t,0,time,0,1));
x += b*keyFrames[i].x;
y += b*keyFrames[i].y;
}
frames[f] = new Frame(x,y);
t+=partialTime;
}
}
private void createInterpolationData() {
time = keyFrames[keyFrames.length-1].time -
keyFrames[0].time;
noOfFrames = 60*time;
partialTime = time/noOfFrames;
frames = new Frame[ceil(noOfFrames)];
}

Resources