"Intrinsics" possible on GPU on OpenGL? - performance

I had this idea for something "intrinsic-like" on OpenGL, but googeling around brought no results.
So basically I have a Compute Shader for calculating the Mandelbrot set (each thread does one pixel). Part of my main-function in GLSL looks like this:
float XR, XI, XR2, XI2, CR, CI;
uint i;
CR = float(minX + gl_GlobalInvocationID.x * (maxX - minX) / ResX);
CI = float(minY + gl_GlobalInvocationID.y * (maxY - minY) / ResY);
XR = 0;
XI = 0;
for (i = 0; i < MaxIter; i++)
XR2 = XR * XR;
XI2 = XI * XI;
XI = 2 * XR * XI + CI;
XR = XR2 - XI2 + CR;
if ((XR * XR + XI * XI) > 4.0)
So my thought was using vec4's instead of floats and so doing 4 calculations/pixels at once and hopefully get a 4x speed-boost (analog to "real" CPU-intrinsics). But my code seems to run MUCH slower than the float-version. There are still some mistakes in there (if anyone would still like to see the code, please say so), but I don't think they are what slows down the code. Before I try around for ages, can anybody tell me right away, if this endeavour is futile?

CPUs and GPUs work quite differently.
CPUs need explicit vectorization in the machine code, either coded manually by the programmer (through what you call 'CPU-intrisnics') or automatically vectorized by the compiler.
GPUs, on the other hand, vectorize by means of running multiple invocations of your shader (aka kernel) on their cores in parallel.
AFAIK, on modern GPUs, additional vectorization within a thread is neither needed nor supported: instead of manufacturing a single core that can add 4 floats per clock (for example), it's more beneficial to have four times as many simpler cores, each of them being able to add a single float per clock. This way you still get the same peak FLOPS for the entire chip, while at the same time enabling full utilization of the circuitry even when the individual shader code cannot be vectorized. The thing is that most code, by means of necessity, will have at least some scalar computations in it.
The bottom line is: it's likely that your code already squeezes the most out of the GPU as possible for this specific task.


How inefficient is my ray-box-intersection algorithm?

I am experimenting a little bit with shaders and the calculation of a collision between ray-box which is done following way:
inline bool hitsCube(in Ray ray, in Cube cube,
out float tMin, out float tMax,
out float3 signMin, out float3 signMax)
float3 biggerThan0 = ray.odir > 0; // ray.odir = (1.0/ray.dir)
float3 lessThan0 = 1.0f - biggerThan0;
float3 tMinXYZ = cube.center + biggerThan0 * cube.minSize + lessThan0 * cube.maxSize;
float3 tMaxXZY = cube.center + biggerThan0 * cube.maxSize + lessThan0 * cube.minSize;
float3 rMinXYZ = (tMinXYZ - ray.origin) * ray.odir;
float3 rMaxXYZ = (tMaxXZY - ray.origin) * ray.odir;
float minV = max(rMinXYZ.x, max(rMinXYZ.y, rMinXYZ.z));
float maxV = min(rMaxXYZ.x, min(rMaxXYZ.y, rMaxXYZ.z));
tMin = minV;
tMax = maxV;
signMin = (rMinXYZ == minV) * lessThan0; // important calculation for another algorithm, but no context provided here
signMax = (rMaxXYZ == maxV) * lessThan0;
return maxV > minV * (minV + maxV >= 0); // last multiplication makes sure the origin of the ray is outside the cube
Considering this function could be called inside a hlsl-shader many, many times (for some pixels lets say at least 200/300 times): Is my implementation of the collision logic inefficient?
Not rally a easily answerable "question", and hard to say without knowing all else that's going on around it, but just a few random thoughts:
a) if you're really interested in knowing that this could would look like on the GPU I'd suggest "porting" that to a CUDA kernel, then using CUDA to generate PTX and SASS for a modern GPU (say, sm75 for turing or sm86 for ampere); then compare two or three variants of that in SASS output.
b) the "converting logic to multiplications" might give you less than you think - if the logic isn't too complicated there's a good change you might end up with a few predicates and not much warp divergence at all, so might not be too bad. Only way to tell is look at PTX and/or SASS output, see 'a'.
c) your formulation of tMinXYZ/tMaxXYZ is (IMHO) unnecesarily complicated: just express it with min/max operations, which are really cheap on GPUs. Also see the respective chapter "ray/box intersection" in the ray tracing gems 2 book (which is free for download). Also more numerically stable btw.
d) re "lags... is my logic inefficient" - actual assembly "efficiency" will rarely have such gigantic effects; usually the culprit for noticeable "lags" is either memory stalls (hard to guess what's going on), or something going horribly wrong for other reasons (see next bullet).
e) just a hunch: I would check rays where some of the direction components are 0. In this case you're dividing by 0 (never a good idea), and in particular if this gets multiplied with 0.f (which in your case can happen) you'll get NaNs, and since "comparison with NaN is always false" you may end with cases where your traversal logic always goes down instead of skipping. Not the same as "efficiency" of your logic, but something to look out for. Good fix is to always change each ray.dir that's 0.f to 1e-6f or so.

OpenCL crash when calling finish()

I am writing an OpenCL app on mac using c++, and it crashes in certain cases depending on the work size.
The program crashes due to a SIGABRT.
Is there any way to get more information about the error?
Why is SIGABRT being raised? Can I catch it?
I realize that this program is a doozie, however I will try to explain it in case anyone would like to take a stab at it.
Through debugging I discovered that the cause of the SIGABRT was one of the kernels timing out.
The program is a tile-based 3D renderer. It is an OpenCL implementation of this algorithm: https://github.com/ssloy/tinyrenderer
The screen is divided into 8x8 tiles. One of the kernels (the tiler) computes which polygons overlap each tile, storing the results in a data structure called tilePolys. A subsequent kernel (the rasterizer), which runs one work item per tile, iterates over the list of polys occupying the tile and rasterizes them.
The tiler writes to an integer buffer which is a list of lists of polygon indices. Each list is of a fixed size (polysPerTile + 1 for the count) where the first element is the count and the subsequent polysPerTile elements are indices of polygons in the tile. There is one such list per tile.
For some reason in certain cases the tiler writes a very large poly count (13172746) to one of the tile's lists in tilePolys. This causes the rasterizer to loop for a long time and time out.
The strange thing is that the index to which the large count is written is never accessed by the tiler.
The code for the tiler kernel is below:
// this kernel is executed once per polygon
// it computes which tiles are occupied by the polygon and adds the index of the polygon to the list for that tile
kernel void tiler(
// number of polygons
ulong nTris,
// width of screen
int width,
// height of screen
int height,
// number of tiles in x direction
int tilesX,
// number of tiles in y direction
int tilesY,
// number of pixels per tile (tiles are square)
int tileSize,
// size of the polygon list for each tile
int polysPerTile,
// 4x4 matrix representing the viewport
global const float4* viewport,
// vertex positions
global const float* vertices,
// indices of vertices
global const int* indices,
// array of array-lists of polygons per tile
// structure of list is an int representing the number of polygons covering that tile,
// followed by [polysPerTile] integers representing the indices of the polygons in that tile
// there are [tilesX*tilesY] such arraylists
volatile global int* tilePolys)
size_t faceInd = get_global_id(0);
// compute vertex position in viewport space
float3 vs[3];
for(int i = 0; i < 3; i++) {
// indices are vertex/uv/normal
int vertInd = indices[faceInd*9+i*3];
float4 vertHomo = (float4)(vertices[vertInd*4], vertices[vertInd*4+1], vertices[vertInd*4+2], vertices[vertInd*4+3]);
vertHomo = vec4_mul_mat4(vertHomo, viewport);
vs[i] = vertHomo.xyz / vertHomo.w;
float2 bboxmin = (float2)(INFINITY,INFINITY);
float2 bboxmax = (float2)(-INFINITY,-INFINITY);
// size of screen
float2 clampCoords = (float2)(width-1, height-1);
// compute bounding box of triangle in screen space
for (int i=0; i<3; i++) {
for (int j=0; j<2; j++) {
bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
// transform bounding box to tile space
int2 tilebboxmin = (int2)(bboxmin[0] / tileSize, bboxmin[1] / tileSize);
int2 tilebboxmax = (int2)(bboxmax[0] / tileSize, bboxmax[1] / tileSize);
// loop over all tiles in bounding box
for(int x = tilebboxmin[0]; x <= tilebboxmax[0]; x++) {
for(int y = tilebboxmin[1]; y <= tilebboxmax[1]; y++) {
// get index of tile
int tileInd = y * tilesX + x;
// get start index of polygon list for this tile
int counterInd = tileInd * (polysPerTile + 1);
// get current number of polygons in list
int numPolys = atomic_inc(&tilePolys[counterInd]);
// if list is full, skip tile
if(numPolys >= polysPerTile) {
// decrement the count because we will not add to the list
} else {
// otherwise add the poly to the list
// the index is the offset + numPolys + 1 as tilePolys[counterInd] holds the poly count
int ind = counterInd + numPolys + 1;
tilePolys[ind] = (int)(faceInd);
My theories are that either:
I have incorrectly implemented the atomic functions for reading and incrementing the count
I am using an incorrect number format causing garbage to be written into tilePolys
One of my other kernels is inadvertently writing into the tilePolys buffer
I do not think it is the last one though because if instead of writing faceInd to tilePolys, I write a constant value, the large poly count disappears.
tilePolys[counterInd+numPolys+1] = (int)(faceInd); // this is the problem line
tilePolys[counterInd+numPolys+1] = (int)(5); // this fixes the issue
It looks like your kernel is crashing on the GPU itself. You can't really get any extra diagnostics about that directly, at least not on macOS. You'll need to start narrowing down the problem. Some suggestions:
As the crash is currently happening in clFinish() you don't know what asynchronous command is causing the crash. Try switching all your enqueue calls to blocking mode. This should cause it to crash in the call that's actually going wrong.
Check return/error codes on all OpenCL API calls. Sometimes, ignoring an error from an earlier call can cause problems in a later call which relies on earlier results. For example, if creating a buffer fails, passing the result of that buffer creation as a kernel argument will cause problems when trying to run the kernel.
The most likely reason for the crash is that your OpenCL kernel is accessing memory out of bounds or is otherwise misusing pointers. Re-check any array index calculations.
Check if the problem occurs with smaller work batches. Scale up from one workgroup (or work item if not using groups) and see if it only occurs beyond a certain work size. This may give you a clue about buffer sizes and array indices that might be causing the crash.
Systematically comment out parts of your kernel. If the crash goes away if you comment out a specific piece of code, there's a good chance the problem is in that code.
If you've narrowed the problem down to a small area of code but can't work out where it's coming from, start recording diagnostic output to check that variables have the values you're expecting.
Without seeing any code, I can't give you any more specific advice than that.
Note that OpenCL is deprecated on macOS, so if you're specifically targeting that platform and don't need to support Linux, Windows, etc. I recommend learning Metal Compute instead. Apple has made it clear that this is the GPU programming platform they want to support, and the tooling for it is already much better than their OpenCL tooling ever was.
I suspect Apple will eventually stop implementing OpenCL support when they release a Mac with a new type of GPU, so even if you're targeting the Mac as well as other platforms, you will probably need to switch to Metal on the Mac somewhere down the line anyway. As of macOS 10.14, the minimum system requirements of the OS already include a Metal-capable GPU, so you only need OpenCL as a fallback if you wish to support all Mac models able to run 10.13 or an even older OS version.

Dot product vs Direct vector components sum performance in shaders

I'm writing CG shaders for advanced lighting calculation for game based on Unity. Sometimes it is needed to sum all vector components. There are two ways to do it:
Just write something like:
float sum = v.x + v.y + v.z;
Or do something like:
float sum = dot(v,float3(1,1,1));
I am really curious about what is faster and looks better for code style.
It's obvious that if we have same question for CPU calculations, the first simle way is much better. Because of:
a) There is no need to allocate another float(1,1,1) vector
b) There is no need to multiply every original vector "v" components by 1.
But since we do it in shader code, which runs on GPU, I belive there is some great hardware optimization for dot product function, and may be allocation of float3(1,1,1) will be translated in no allocation at all.
float4 _someVector;
void surf (Input IN, inout SurfaceOutputStandard o){
float sum = _someVector.x + _someVector.y + _someVector.z + _someVector.w;
// VS
float sum2 = dot(_someVector, float4(1,1,1,1));
Check this link.
Vec3 Dot has a cost of 3 cycles, while Scalar Add has a cost of 1.
Thus, in almost all platforms (AMD and NVIDIA):
float sum = v.x + v.y + v.z; has a cost of 2
float sum = dot(v,float3(1,1,1)); has a cost of 3
The first implementation should be faster.
Implementation of the Dot product in cg: https://developer.download.nvidia.com/cg/dot.html
IMHO difference is immeasurable, in 98% of the cases, but first one should be faster, because multiplication is a "more expensive" operation

OpenCL for-loop doing strange things

I'm currently implementing terrain generation in OpenCL using layered octaves of noise and I've stumbled upon this problem:
float multinoise2d(float2 position, float scale, int octaves, float persistence)
float result = 0.0f;
float sample = 0.0f;
float coefficient = 1.0f;
for(int i = 0; i < octaves; i++){
// get a sample of a simple signed perlin noise
sample = sgnoise2d(position/scale);
if(i > 0){
// Here is the problem:
// Implementation A, this works correctly.
coefficient = pown(persistence, i);
// Implementation B, using this only the first
// noise octave is visible in the terrain.
coefficient = persistence;
persistence = persistence*persistence;
result += coefficient * sample;
scale /= 2.0f;
return result;
Does OpenCL parallelize for-loops, leading to synchronization issues here or am I missing something else?
Any help is appreciated!
the problem of your code is with the lines
coefficient = persistence;
persistence = persistence*persistence;
It should be changed to
coefficient = coefficient *persistence;
otherwise on every iteration
the first coeficient grows by just persistence
pow(persistence, 1) ; pow(persistence, 2); pow(persistence, 3) ....
However the second implementation goes
pow(persistence, 1); pow(persistence, 2); pow(persistence, 4); pow(persistence, 8) ......
soon "persistence" will run above the limit for float and you will get zeros (or undefined behavior) in your answer.
Two more things
Accumulation (implementation 2) is not a good idea, specially with real numbers and with algorithms that require accuracy. You might be losing a small fraction of you information every time you accumulate on "persistence" (e.g due to rounding). Prefer direct calculation (1st implementation) over accumulation whenever you can. (plus if this was Serial the 2nd implementation will be readily parallelizable.)
If you are working with AMD OpenCL pay attention to the pow() functions. I have had problems with those on multiple machines on multiple occasions. The functions seem to hang sometimes for no reason. Just FYI.
I'm assuming this is some kind of utility method that is called in your CL kernel. Vivek is correct in his comment above: OpenCL does not parallelize your code for you. You have to leverage OpenCL's facilities for dividing your problem into data-parallel chunks.
Also, I don't see a potential synchronization issue in the above code. All of your variables are in work-item private memory space.

Algorithm to control acceleration until a position is reached

I have a point that moves (in one dimension), and I need it to move smoothly. So I think that it's velocity has to be a continuous function and I need to control the acceleration and then calculate it's velocity and position.
The algorithm doesn't seem something obvious to me, but I guess this must be a common problem, I just can't find the solution.
The final destination of the object may change while it's moving and the movement needs to be smooth anyway.
I guess that a naive implementation would produce bouncing, and I need to avoid that.
This is a perfect candidate for using a "critically damped spring".
Conceptually you attach the point to the target point with a spring, or piece of elastic. The spring is damped so that you get no 'bouncing'. You can control how fast the system reacts by changing a constant called the "SpringConstant". This is essentially how strong the piece of elastic is.
Basically you apply two forces to the position, then integrate this over time. The first force is that applied by the spring, Fs = SpringConstant * DistanceToTarget. The second is the damping force, Fd = -CurrentVelocity * 2 * sqrt( SpringConstant ).
The CurrentVelocity forms part of the state of the system, and can be initialised to zero.
In each step, you multiply the sum of these two forces by the time step. This gives you the change of the value of the CurrentVelocity. Multiply this by the time step again and it will give you the displacement.
We add this to the actual position of the point.
In C++ code:
float CriticallyDampedSpring( float a_Target,
float a_Current,
float & a_Velocity,
float a_TimeStep )
float currentToTarget = a_Target - a_Current;
float springForce = currentToTarget * SPRING_CONSTANT;
float dampingForce = -a_Velocity * 2 * sqrt( SPRING_CONSTANT );
float force = springForce + dampingForce;
a_Velocity += force * a_TimeStep;
float displacement = a_Velocity * a_TimeStep;
return a_Current + displacement;
In systems I was working with a value of around 5 was a good point to start experimenting with the value of the spring constant. Set it too high will result in too fast a reaction, and too low the point will react too slowly.
Note, you might be best to make a class that keeps the velocity state rather than have to pass it into the function over and over.
I hope this is helpful, good luck :)
EDIT: In case it's useful for others, it's easy to apply this to 2 or 3 dimensions. In this case you can just apply the CriticallyDampedSpring independently once for each dimension. Depending on the motion you want you might find it better to work in polar coordinates (for 2D), or spherical coordinates (for 3D).
I'd do something like Alex Deem's answer for trajectory planning, but with limits on force and velocity:
In pseudocode:
xtarget: target position
vtarget: target velocity*
x: object position
v: object velocity
dt: timestep
F = Ki * (xtarget-x) + Kp * (vtarget-v);
F = clipMagnitude(F, Fmax);
v = v + F * dt;
v = clipMagnitude(v, vmax);
x = x + v * dt;
clipMagnitude(y, ymax):
r = magnitude(y) / ymax
if (r <= 1)
return y;
return y * (1/r);
where Ki and Kp are tuning constants, Fmax and vmax are maximum force and velocity. This should work for 1-D, 2-D, or 3-D situations (magnitude(y) = abs(y) in 1-D, otherwise use vector magnitude).
It's not quite clear exactly what you're after, but I'm going to assume the following:
There is some maximum acceleration;
You want the object to have stopped moving when it reaches the destination;
Unlike velocity, you do not require acceleration to be continuous.
Let A be the maximum acceleration (by which I mean the acceleration is always between -A and A).
The equation you want is
v_f^2 = v_i^2 + 2 a d
where v_f = 0 is the final velocity, v_i is the initial (current) velocity, and d is the distance to the destination (when you switch from acceleration A to acceleration -A -- that is, from speeding up to slowing down; here I'm assuming d is positive).
d = v_i^2 / (2A)
is the distance. (The negatives cancel).
If the current distance remaining is greater than d, speed up as quickly as possible. Otherwise, begin slowing down.
Let's say you update the object's position every t_step seconds. Then:
new_position = old_position + old_velocity * t_step + (1/2)a(t_step)^2
new_velocity = old_velocity + a * t_step.
If the destination is between new_position and old_position (i.e., the object reached its destination in between updates), simply set new_position = destination.
You need an easing formula, which you would call at a set interval, passing in the time elapsed, start point, end point and duration you want the animation to be.
Doing time-based calculations will account for slow clients and other random hiccups. Since it calculates on time elapsed vs. the time in which it has to compkete, it will account for slow intervals between calls when returning how far along your point should be in the animation.
The jquery.easing plugin has a ton of easing functions you can look at:
I've found it best to pass in 0 and 1 as my start and end point, since it will return a floating point between the two, you can easily apply it to the real value you are modifying using multiplication.
