Dot product vs Direct vector components sum performance in shaders - performance

I'm writing CG shaders for advanced lighting calculation for game based on Unity. Sometimes it is needed to sum all vector components. There are two ways to do it:
Just write something like:
float sum = v.x + v.y + v.z;
Or do something like:
float sum = dot(v,float3(1,1,1));
I am really curious about what is faster and looks better for code style.
It's obvious that if we have same question for CPU calculations, the first simle way is much better. Because of:
a) There is no need to allocate another float(1,1,1) vector
b) There is no need to multiply every original vector "v" components by 1.
But since we do it in shader code, which runs on GPU, I belive there is some great hardware optimization for dot product function, and may be allocation of float3(1,1,1) will be translated in no allocation at all.
float4 _someVector;
void surf (Input IN, inout SurfaceOutputStandard o){
float sum = _someVector.x + _someVector.y + _someVector.z + _someVector.w;
// VS
float sum2 = dot(_someVector, float4(1,1,1,1));
}

Check this link.
Vec3 Dot has a cost of 3 cycles, while Scalar Add has a cost of 1.
Thus, in almost all platforms (AMD and NVIDIA):
float sum = v.x + v.y + v.z; has a cost of 2
float sum = dot(v,float3(1,1,1)); has a cost of 3
The first implementation should be faster.

Implementation of the Dot product in cg: https://developer.download.nvidia.com/cg/dot.html
IMHO difference is immeasurable, in 98% of the cases, but first one should be faster, because multiplication is a "more expensive" operation

Related

"Intrinsics" possible on GPU on OpenGL?

I had this idea for something "intrinsic-like" on OpenGL, but googeling around brought no results.
So basically I have a Compute Shader for calculating the Mandelbrot set (each thread does one pixel). Part of my main-function in GLSL looks like this:
float XR, XI, XR2, XI2, CR, CI;
uint i;
CR = float(minX + gl_GlobalInvocationID.x * (maxX - minX) / ResX);
CI = float(minY + gl_GlobalInvocationID.y * (maxY - minY) / ResY);
XR = 0;
XI = 0;
for (i = 0; i < MaxIter; i++)
{
XR2 = XR * XR;
XI2 = XI * XI;
XI = 2 * XR * XI + CI;
XR = XR2 - XI2 + CR;
if ((XR * XR + XI * XI) > 4.0)
{
break;
}
}
So my thought was using vec4's instead of floats and so doing 4 calculations/pixels at once and hopefully get a 4x speed-boost (analog to "real" CPU-intrinsics). But my code seems to run MUCH slower than the float-version. There are still some mistakes in there (if anyone would still like to see the code, please say so), but I don't think they are what slows down the code. Before I try around for ages, can anybody tell me right away, if this endeavour is futile?
CPUs and GPUs work quite differently.
CPUs need explicit vectorization in the machine code, either coded manually by the programmer (through what you call 'CPU-intrisnics') or automatically vectorized by the compiler.
GPUs, on the other hand, vectorize by means of running multiple invocations of your shader (aka kernel) on their cores in parallel.
AFAIK, on modern GPUs, additional vectorization within a thread is neither needed nor supported: instead of manufacturing a single core that can add 4 floats per clock (for example), it's more beneficial to have four times as many simpler cores, each of them being able to add a single float per clock. This way you still get the same peak FLOPS for the entire chip, while at the same time enabling full utilization of the circuitry even when the individual shader code cannot be vectorized. The thing is that most code, by means of necessity, will have at least some scalar computations in it.
The bottom line is: it's likely that your code already squeezes the most out of the GPU as possible for this specific task.

How inefficient is my ray-box-intersection algorithm?

I am experimenting a little bit with shaders and the calculation of a collision between ray-box which is done following way:
inline bool hitsCube(in Ray ray, in Cube cube,
out float tMin, out float tMax,
out float3 signMin, out float3 signMax)
{
float3 biggerThan0 = ray.odir > 0; // ray.odir = (1.0/ray.dir)
float3 lessThan0 = 1.0f - biggerThan0;
float3 tMinXYZ = cube.center + biggerThan0 * cube.minSize + lessThan0 * cube.maxSize;
float3 tMaxXZY = cube.center + biggerThan0 * cube.maxSize + lessThan0 * cube.minSize;
float3 rMinXYZ = (tMinXYZ - ray.origin) * ray.odir;
float3 rMaxXYZ = (tMaxXZY - ray.origin) * ray.odir;
float minV = max(rMinXYZ.x, max(rMinXYZ.y, rMinXYZ.z));
float maxV = min(rMaxXYZ.x, min(rMaxXYZ.y, rMaxXYZ.z));
tMin = minV;
tMax = maxV;
signMin = (rMinXYZ == minV) * lessThan0; // important calculation for another algorithm, but no context provided here
signMax = (rMaxXYZ == maxV) * lessThan0;
return maxV > minV * (minV + maxV >= 0); // last multiplication makes sure the origin of the ray is outside the cube
}
Considering this function could be called inside a hlsl-shader many, many times (for some pixels lets say at least 200/300 times): Is my implementation of the collision logic inefficient?
Not rally a easily answerable "question", and hard to say without knowing all else that's going on around it, but just a few random thoughts:
a) if you're really interested in knowing that this could would look like on the GPU I'd suggest "porting" that to a CUDA kernel, then using CUDA to generate PTX and SASS for a modern GPU (say, sm75 for turing or sm86 for ampere); then compare two or three variants of that in SASS output.
b) the "converting logic to multiplications" might give you less than you think - if the logic isn't too complicated there's a good change you might end up with a few predicates and not much warp divergence at all, so might not be too bad. Only way to tell is look at PTX and/or SASS output, see 'a'.
c) your formulation of tMinXYZ/tMaxXYZ is (IMHO) unnecesarily complicated: just express it with min/max operations, which are really cheap on GPUs. Also see the respective chapter "ray/box intersection" in the ray tracing gems 2 book (which is free for download). Also more numerically stable btw.
d) re "lags... is my logic inefficient" - actual assembly "efficiency" will rarely have such gigantic effects; usually the culprit for noticeable "lags" is either memory stalls (hard to guess what's going on), or something going horribly wrong for other reasons (see next bullet).
e) just a hunch: I would check rays where some of the direction components are 0. In this case you're dividing by 0 (never a good idea), and in particular if this gets multiplied with 0.f (which in your case can happen) you'll get NaNs, and since "comparison with NaN is always false" you may end with cases where your traversal logic always goes down instead of skipping. Not the same as "efficiency" of your logic, but something to look out for. Good fix is to always change each ray.dir that's 0.f to 1e-6f or so.

Gradient of a function in OpenCL

I'm playing around a bit with OpenCL and I have a problem which can be simplified as follows.
I'm sure this is a common problem but I cannot find many references or examples that would show me how this is usually done
Suppose for example you have a function (writing in CStyle syntax)
float function(float x1, float x2, float x3, float x4, float x5)
{
return sin(x1) + x1*cos(x2) + x3*exp(-x3) + x4 + x5;
}
I can also implement the gradient of this function as
void functionGradient(float x1, float x2, float x3, float x4, float x5, float gradient[])
{
gradient[0] = cos(x1) + cos(x2);
gradient[1] = -sin(x2);
gradient[2] = exp(-x3) - x3*exp(-x3);
gradient[3] = 1.0f;
gradient[4] = 1.0f;
}
Now I was thinking of implementing an OpenCL C kernel function that would do the same thing, cause I wanted to speed this up. The only way I have in mind to do this is to assign to each workunit a component of the gradient but then I'd need to put a bunch of if statements within the code to figure which workunit is computing what component, which isn't good in general because of divergence.
Therefore here is the question, how is such problem tackled in general? I'm aware for example of Gradient Descent implementations on GPU, see machine learning with backpropagation for example. So I wonder what is generally done to avoid divergence in the code.
Follow up from suggestion
I'm thinking of a possible SIMD compatible implementation as follows:
/*
Pseudo OpenCL-C code
here weight is a 5x5 array containing weights in {0,1} masking the relevant
computation
*/
__kernel void functionGradient(float x1, float x2, float x3, float x4, float x5, __global float* weight,__global* float gradient)
{
size_t threadId = get_global_id(0);
gradient[threadId] =
weight[5*threadId]*(cos(x1) + cos(x2)) +
weight[5*threadId + 1]*(-sin(x2)) +
weight[5*threadId + 2]*(exp(-x3) - x3*exp(x3)) +
weight[5*threadId + 3] + weight[5*threadId + 4];
barrier(CLK_GLOBAL_MEM_FENCE);
}
If your gradient function only has 5 components, it does not make sense to parallelize it in a way that one thread does one component. As you mentioned, GPU parallelization does not work if the mathematical structure of each components is different (multiple instructionsmultiple data, MIMD).
If you would need to compute the 5-dimensional gradient at 100k different coordinates however, then each thread would do all 5 components for each coordinate and parallelization would work efficiently.
In the backpropagation example, you have one gradient function with thousands of dimensions. In this case you would indeed parallelize the gradient function itself such that one thread computes one component of the gradient. However in this case all gradient components have the same mathematical structure (with different weighting factors in global memory), so branching is not required. Each gradient component is the same equation with different numbers (single instruction multiple data, SIMD). GPUs are designed to only handle SIMD; this is also why they are so energy efficient (~30TFLOPs # 300W) compared to CPUs (which can do MIMD, ~2-3TFLOPs # 150W).
Finally, note that backpropagation / neural nets are specifically designed to be SIMD. Not every new algorithm you come across can be parallelize in this manner.
Coming back to your 5-dimensional gradient example: There are ways to make it SIMD-compatible without branching. Specifically bit-maskimg: You would compute 2 cosines (for componet 1 express the sine through cosine) and one exponent and add all the terms up with a factor in front of each. The terms that you don't need, you multiply by a factor 0. Lastly, the factors are functions of the component ID. However as mentioned above, this only makes sense if you have many thousands to millions of dimensions.
Edit: here the SIMD-compatible version with bit masking:
kernel void functionGradient(const global float x1, const global float x2, const global float x3, const global float x4, const global float x5, global float* gradient) {
const float gid = get_global_id(0);
const float cosx1 = cos(x1);
const float cosx2 = cos((gid!=1)*x2+(gid==1)*3.1415927f);
const float expmx3 = exp(-x3);
gradient[gid] = (gid==0)*cosx1 + (gid<=1)*cosx2 + (gid==2)*(expmx3-x3*expmx3) + (gid>=3);
}
Note that there is no additional global/local memory access and all the (mutually exclusive) weighting factors are functions of the gloal ID. Each thread computes exactly the same thing (2 cos, 1 exp and a fes multiplications/additions) without any branching. Trigonometric functions / divisions take much more time than multiplications/additions, so as few as possible should be used by pre-calculating terms.

Ray-triangle intersection

I saw that Fast Minimum Storage Ray/Triangle Intersection by Moller and Trumbore is frequently recommended.
The thing is, I don't mind pre-computing and storing any amounts of data, as long as it speeds-up the intersection.
So my question is, not caring about memory, what are the fastest methods of doing ray-triangle intersection?
Edit: I wont move the triangles, i.e. it is a static scene.
As others have mentioned, the most effective way to speed things up is to use an acceleration structure to reduce the number of ray-triangle intersections needed. That said, you still want your ray-triangle intersections to be fast. If you're happy to precompute stuff, you can try the following:
Convert your ray lines and your triangle edges to Plücker coordinates. This allows you to determine if your ray line passes through a triangle at 6 multiply/add's per edge. You will still need to compare your ray start and end points with the triangle plane (at 4 multiply/add's per point) to make sure it actually hits the triangle.
Worst case runtime expense is 26 multiply/add's total. Also, note that you only need to compute the ray/edge sign once per ray/edge combination, so if you're evaluating a mesh, you may be able to use each edge evaluation twice.
Also, these numbers assume everything is being done in homogeneous coordinates. You may be able to reduce the number of multiplications some by normalizing things ahead of time.
I have done a lot of benchmarks, and I can confidently say that the fastest (published) method is the one invented by Havel and Herout and presented in their paper Yet Faster Ray-Triangle Intersection (Using SSE4). Even without using SSE it is about twice as fast as Möller and Trumbore's algorithm.
My C implementation of Havel-Herout:
typedef struct {
vec3 n0; float d0;
vec3 n1; float d1;
vec3 n2; float d2;
} isect_hh_data;
void
isect_hh_pre(vec3 v0, vec3 v1, vec3 v2, isect_hh_data *D) {
vec3 e1 = v3_sub(v1, v0);
vec3 e2 = v3_sub(v2, v0);
D->n0 = v3_cross(e1, e2);
D->d0 = v3_dot(D->n0, v0);
float inv_denom = 1 / v3_dot(D->n0, D->n0);
D->n1 = v3_scale(v3_cross(e2, D->n0), inv_denom);
D->d1 = -v3_dot(D->n1, v0);
D->n2 = v3_scale(v3_cross(D->n0, e1), inv_denom);
D->d2 = -v3_dot(D->n2, v0);
}
inline bool
isect_hh(vec3 o, vec3 d, float *t, vec2 *uv, isect_hh_data *D) {
float det = v3_dot(D->n0, d);
float dett = D->d0 - v3_dot(o, D->n0);
vec3 wr = v3_add(v3_scale(o, det), v3_scale(d, dett));
uv->x = v3_dot(wr, D->n1) + det * D->d1;
uv->y = v3_dot(wr, D->n2) + det * D->d2;
float tmpdet0 = det - uv->x - uv->y;
int pdet0 = ((int_or_float)tmpdet0).i;
int pdetu = ((int_or_float)uv->x).i;
int pdetv = ((int_or_float)uv->y).i;
pdet0 = pdet0 ^ pdetu;
pdet0 = pdet0 | (pdetu ^ pdetv);
if (pdet0 & 0x80000000)
return false;
float rdet = 1 / det;
uv->x *= rdet;
uv->y *= rdet;
*t = dett * rdet;
return *t >= ISECT_NEAR && *t <= ISECT_FAR;
}
One suggestion could be to implement the octree (http://en.wikipedia.org/wiki/Octree) algorithm to partition your 3D Space into very fine blocks. The finer the partitioning the more memory required, but the better accuracy the tree gets.
You still need to check ray/triangle intersections, but the idea is that the tree can tell you when you can skip the ray/triangle intersection, because the ray is guaranteed not to hit the triangle.
However, if you start moving your triangle around, you need to update the Octree, and then I'm not sure it's going to save you anything.
Found this article by Dan Sunday:
Based on a count of the operations done up to the first rejection test, this algorithm is a bit less efficient than the MT (Möller & Trumbore) algorithm, [...]. However, the MT algorithm uses two cross products whereas our algorithm uses only one, and the one we use computes the normal vector of the triangle's plane, which is needed to compute the line parameter rI. But, when the normal vectors have been precomputed and stored for all triangles in a scene (which is often the case), our algorithm would not have to compute this cross product at all. But, in this case, the MT algorithm would still compute two cross products, and be less efficient than our algorithm.
http://geomalgorithms.com/a06-_intersect-2.html

OpenCL for-loop doing strange things

I'm currently implementing terrain generation in OpenCL using layered octaves of noise and I've stumbled upon this problem:
float multinoise2d(float2 position, float scale, int octaves, float persistence)
{
float result = 0.0f;
float sample = 0.0f;
float coefficient = 1.0f;
for(int i = 0; i < octaves; i++){
// get a sample of a simple signed perlin noise
sample = sgnoise2d(position/scale);
if(i > 0){
// Here is the problem:
// Implementation A, this works correctly.
coefficient = pown(persistence, i);
// Implementation B, using this only the first
// noise octave is visible in the terrain.
coefficient = persistence;
persistence = persistence*persistence;
}
result += coefficient * sample;
scale /= 2.0f;
}
return result;
}
Does OpenCL parallelize for-loops, leading to synchronization issues here or am I missing something else?
Any help is appreciated!
the problem of your code is with the lines
coefficient = persistence;
persistence = persistence*persistence;
It should be changed to
coefficient = coefficient *persistence;
otherwise on every iteration
the first coeficient grows by just persistence
pow(persistence, 1) ; pow(persistence, 2); pow(persistence, 3) ....
However the second implementation goes
pow(persistence, 1); pow(persistence, 2); pow(persistence, 4); pow(persistence, 8) ......
soon "persistence" will run above the limit for float and you will get zeros (or undefined behavior) in your answer.
EDIT
Two more things
Accumulation (implementation 2) is not a good idea, specially with real numbers and with algorithms that require accuracy. You might be losing a small fraction of you information every time you accumulate on "persistence" (e.g due to rounding). Prefer direct calculation (1st implementation) over accumulation whenever you can. (plus if this was Serial the 2nd implementation will be readily parallelizable.)
If you are working with AMD OpenCL pay attention to the pow() functions. I have had problems with those on multiple machines on multiple occasions. The functions seem to hang sometimes for no reason. Just FYI.
I'm assuming this is some kind of utility method that is called in your CL kernel. Vivek is correct in his comment above: OpenCL does not parallelize your code for you. You have to leverage OpenCL's facilities for dividing your problem into data-parallel chunks.
Also, I don't see a potential synchronization issue in the above code. All of your variables are in work-item private memory space.

Resources