GLSL shader algorithm optimization - performance

Is there a anyway to optimize the next algorithm to be any faster, even if is just a small speed increase?
const mat3 factor = mat3(1.0, 1.0, 1.0, 2.112, 1.4, 0.0, 0.0, 2.18, -2.21);
vec3 calculate(in vec2 coord)
{
vec3 sample = texture2D(texture_a, coord).rgb;
return (factor / sample) * 2.15;
}

The only significant optimization I can think of is to pack texture_a and texture_b into a single three-channel texture, if you can. That saves you one of the two texture lookups, which are most likely to be the bottleneck here.

#Thomas answer is the most helpfull, since texture lookups are most expensive, if his solution is possible in your application. If you already use those textures somewhere else better pass the values as parameters to avoid duplicate lookups.
Else I don't know if it can be optimized that much but some straight forward things that come to my mind.
Compiler optimizations:
Assign const keyword to coord parameter, if possible to sample too.
Assign f literal in each float element.
Maybe manually assign mat
I don't know if its faster because I don't know how the matrix multiplication is implemented but since the constant factor matrix contains many ones and zeros it maybe can be manually assigned.
vec3 calculate(const in vec2 coord)
{
//not 100% sure if that init is possible
const vec3 sample = vec3(texture2D(texture_a, coord).r
texture2D(texture_b, coord).ra - 0.5f);
vec3 result = vec3(sample.y);
result.x += sample.x + sample.z;
result.y += 2.112f * sample.x;
result.z *= 2.18f;
result.z -= 2.21f * sample.z;
return result;
}

Related

Rendering to custom FrameBuffer using same texture both as input and output

Some Fragment shaders in ShaderToy (e.g. fluid dynamics, https://www.shadertoy.com/view/4tGfDW ) use same buffer as both input and output. But when I try to do this in my C/C++ code it does not work (I renders strange checkerboard artifacts like inconsistent visual memory). To workaround this issue I have to use two different FrameBuffers A,B and flip textures ( first render A to B then render B back to A )
I understand that OpenGL does not allow to use the same texture both as input and output (?) due to memory consistency issues.
But isn't there more elegant solution than using two FrameBuffers ? E.g. using some lock, or temporary cache (I don't know some sychronization flag which takes care of this)???
EDIT - Details to answer the comment/question:
OpenGL (depending the GL version) has some very specific rules of what
can and can''t be done when the same texture is used as render target
and sampler input. If your use case can be implemented within this set
of requirements or not is not clear, as you have not explained what
exactly you need or want to do here.
basically I want to implement Fluid-Dynamics solver (e.g. that from ShaderToy linked above) as well as other partial differential equation solvers. That means each pixel output depends on some convolution mask (derivative, laplacian, average) of neighboring pixels. There may be also some movement (advection) which means reading values form distant pixels.
Currently I realized the artifacts appear mostly when I read/write pixels which are different place - i.e. it is non-local (e.g. pixel[100,100] depend on pixel[10,10])
Example of simple Fluid-Solver from Shadertoy:
vec4 solveFluid(sampler2D smp, vec2 uv, vec2 w, float time, vec3 mouse, vec3 lastMouse)
{
const float K = 0.2;
const float v = 0.55;
vec4 data = textureLod(smp, uv, 0.0);
vec4 tr = textureLod(smp, uv + vec2(w.x , 0), 0.0);
vec4 tl = textureLod(smp, uv - vec2(w.x , 0), 0.0);
vec4 tu = textureLod(smp, uv + vec2(0 , w.y), 0.0);
vec4 td = textureLod(smp, uv - vec2(0 , w.y), 0.0);
vec3 dx = (tr.xyz - tl.xyz)*0.5;
vec3 dy = (tu.xyz - td.xyz)*0.5;
vec2 densDif = vec2(dx.z ,dy.z);
data.z -= dt*dot(vec3(densDif, dx.x + dy.y) ,data.xyz); //density
vec2 laplacian = tu.xy + td.xy + tr.xy + tl.xy - 4.0*data.xy;
vec2 viscForce = vec2(v)*laplacian;
data.xyw = textureLod(smp, uv - dt*data.xy*w, 0.).xyw; //advection
vec2 newForce = vec2(0);
data.xy += dt*(viscForce.xy - K/dt*densDif + newForce); //update velocity
data.xy = max(vec2(0), abs(data.xy)-1e-4)*sign(data.xy); //linear velocity decay
#ifdef USE_VORTICITY_CONFINEMENT
data.w = (tr.y - tl.y - tu.x + td.x);
vec2 vort = vec2(abs(tu.w) - abs(td.w), abs(tl.w) - abs(tr.w));
vort *= VORTICITY_AMOUNT/length(vort + 1e-9)*data.w;
data.xy += vort;
#endif
data.y *= smoothstep(.5,.48,abs(uv.y-0.5)); //Boundaries
data = clamp(data, vec4(vec2(-10), 0.5 , -10.), vec4(vec2(10), 3.0 , 10.));
return data;
}
Currently I realized the artifacts appear mostly when I read/write pixels which are different place - i.e. it is non-local (e.g. pixel[100,100] depend on pixel[10,10])
Yes, this is never going to work on GPUs, as there are no particular guarantees on the order of individual fragment shader invocations whatsoever. So if the invocation writing to pixel [100,100] will see the results of the invocation writing to [10,10] or the original data will be totally random. As per the spec, you're getting undefined values when reading in such a cuncurrent read/write scenario, so theoretically, you could get even not one or the other, but see partial writes or totally different values (although that's not likely to occur on real world hardware).
And any order guarantees of such a scale simply does not make sense within the render pipeline, so there is also no partical means of synchronization you can manually add to solve this issue.
To workaround this issue I have to use two different FrameBuffers A,B and flip textures ( first render A to B then render B back to A )
Yes, the ping-pong approach is what you should do for this use case. And honestly, it should not incur any significant performance penalty in that scenario anyway, as you seem to write to each output pixel once anyway, so you don't need an additional copy of "untouched" pixels. So all it costs is the additional memory.

Shader - Unexpected behaviour when dividing with a high value

I have this line:
gl_FragColor = vec4(worldPos.x / maxX, worldPos.z / maxZ, 1.0, 1.0);
Where worldPos.x and worldPos.y goes from 0 to 19900. maxX and maxZ are float uniforms. It works as expected when maxX and maxZ are set to 5000.0 (a gradient to white and above 5000 it's all white), but when maxX and maxZ are set to 19900.0 it all turns blue. Why is that and how to get around it? Hardcoding the values doesn't make a difference, i.e:
gl_FragColor = vec4(worldPos.x / 5000.0, worldPos.z / 5000.0, 1.0, 1.0);
works as expected while:
gl_FragColor = vec4(worldPos.x / 19900.0, worldPos.z / 19900.0, 1.0, 1.0);
makes it all blue. This only happens on some devices and not on others.
Update:
Adding highp modifier (as suggested by Michael below) solved it for one device, but when testing on another it didn't make any difference. Then I tried to do the division on the CPU (also suggested by Michael) like this:
in java, before passing it as uniform:
float maxX = 1.0f / 19900.0f;
float maxZ = 1.0f / 19900.0f;
program.setUniformf(maxXUniform, maxX);
program.setUniformf(maxZUniform, maxZ);
in shader:
uniform float maxX;
uniform float maxZ;
...
gl_FragColor = vec4(worldPos.x * maxX, worldPos.z * maxZ, 1.0, 1.0);
...
Final sulotion:
This still didn't cut it. Now the values are too small so when passed in to the shader they turn 0 due to too low float precision. Then I tried to multiply it by 100 before passing it in, and then multiplying it by 0.01 inside the shader.
in java:
float maxX = 100.0f / 19900.0f;
float maxZ = 100.0f / 19900.0f;
program.setUniformf(maxXUniform, maxX);
program.setUniformf(maxZUniform, maxZ);
in shader:
uniform float maxX;
uniform float maxZ;
...
gl_FragColor = vec4(worldPos.x * 0.01 * maxX, worldPos.z * 0.01 * maxZ, 1.0, 1.0);
...
And that solved the problem. Now the highp modifier isn't needed. Maybe it isn't the prettiest sulotion but it's efficient and robust.
I guess you're running OpenGL ES? Well,the floating precision sucks on many,usually quite old, devices.I had similar issues on several occasions when implementing cascaded shadows mapping in shaders for mobile hardware.
Make sure you use highp qualifier for those variables. (note - that might not solve the issue, but is worth to try)
Another possible solution: don't perform the division in the shader. That's a quite heavy operation for many old and weak implementations anyway. Try to avoid division, sqrt(),pow().Run shader profiler and you will be surprised to find out how much those ops are HEAVY! (iOS emulator on Mac has a nice shader profiler) Try to pass the results directly as uniforms.I am not sure that would be a problem in your case,as I can't see any of these variables bound to per-fragment execution.
And if it still doesn't help, then usually there is nothing you can do about that. That's the old hardware/GLSL implementation issue. But I am sure,if you calculate that on CPU and upload the results as uniforms, that should solve the issue.

Why do people use sqrt(dot(distanceVector, distanceVector)) over OpenGL's distance function?

When using ShaderToy I often see people using something like:
vec2 uv = fragCoord / iResolution;
vec2 centerPoint = vec2(0.5);
vec2 distanceVector = uv - centerPoint;
float dist = sqrt(dot(distanceVector, distanceVector));
over OpenGL's distance function:
vec2 uv = fragCoord / iResolution;
vec2 centerPoint = vec2(0.5);
float dist = distance(uv, centerPoint);
I'm just curious why this is (my guess is that it has something to do with speed or support for distance).
I loosely understand that if the arguments are the same, the square root of a dot product equals the length of the vector: the distance?
Doing essentially the same thing, people often choose the sqrt option for one of two reasons:
1. They don't know about/remember the distance function
2. They are trusting themselves and their own math to prove that is not a problem to cause a bug (avoiding OpenGL problems)
Sometimes to optimize early exits as for light volumes for example:
float distSquared( vec3 A, vec3 B )
{
vec3 C = A - B;
return dot( C, C );
}
// Early escape when the distance between the fragment and the light
// is smaller than the light volume/sphere threshold.
//float dist = length(lights[i].Position - FragPos);
//if(dist < lights[i].Radius)
// Let's optimize by skipping the expensive square root calculation
// for when it's necessary.
float dist = distSquared( lights[i].Position, FragPos );
if( dist < lights[i].Radius * lights[i].Radius )
{
// Do expensive calculations.
If you need the distance later on simply:
dist = sqrt( dist )
EDIT: Another example.
Another use case that I've recently learnt, suppose that you want to have two positions: vec3 posOne and vec3 posTwo and you want the distance to each of those. The naive way would be to compute them independently: float distanceOne = distance( posOne, otherPos ) and float distanceTwo = distance( posTwo, otherPos ). But you want to exploit SIMD! So you do: posOne -= otherPos; posTwo -= otherPos so you are ready to compute the euclidean distance by SIMD: vec2 SIMDDistance = vec2( dot( posOne ), dot( posTwo ) ); and you can then use SIMD for the square root: SIMDDistance = sqrt( SIMDDistance ); where the distance to posOne is on the .x component of the SIMDDistance variable and the .y component contains the distance to posTwo.
Using dot gives you a quick way to experiment with quadratic/linear function of distance.
According to The Book of Shader, distence() and length() use square root(sqrt()) internally. Using sqrt() and all the functions depend on it can be expensive. Just use dot() if possible!
I guess sqrt() is a mathematical computation, but dot() is a vector computation which GPU is good at.
What I often do is the following (example):
vec3 vToLight = light.pos - cam.pos;
float lengthSqToLight = dot(vToLight, vToLight);
if (lengthSqToLight > 0 && lengthSqToLight <= maxLengthSq) {
float lengthToLight = sqrt(lengthSqToLight); // length is often needed later
...
vec3 vToLightNormalized = vToLight / lengthToLight; // avoid normalize() => avoids second sqrt
...
// later use lengthToLight
}

Clustering objects in GPU

My algorithm is simple for clustering, and it goes like this.
First object is grouped by all other objects which the distance between them is lower the X.
Then we go to the second object, if not included in the first group, we run the same algorithm on the other objects that are not included in the first group,
and so on...
I'm trying to do this algo in the GPU using the fragment shader.
First I set all the locations into a RGBA float texture. Setting for each pixel the location (x,y) - z and w are free for now. Then i draw to a result texture my calculations using the shader. In the end i will read the pixels of the result texture and do my code.
Tried many variations of code, and multi phases draw for performing my algorithm but i'm not happy with the time performances.
The question is,
Is there a way to do one run over the texture to perform my wish (single draw phase) ?
My latest try is this algorithm - My fragment shader
precision highp float;
uniform sampler2D locs;
varying vec2 coord;
uniform float clusterDistance;
const float textureSize = 64.;
void main()
{
// Getting my location
vec4 currData = texture2D(locs, coord);
float offsetPix = 1./textureSize/2.;
vec2 coordIdx = (coord - offsetPix) * textureSize;
// Getting the index of my location
float myIdx = coordIdx.y * textureSize + coordIdx.x;
int clusterIdx = 0;
float clusterNum = 0.;
// Running over all the other locations until me and finding the first close object to me
for (float i=0.;i<textureSize*textureSize;++i)
{
clusterNum = i +1.;
// Which mean that we didn't find any closed object to me so we stop
if (i == myIdx)
{
break;
}
else
{
vec2 pntLoc = vec2(mod(i, textureSize), floor(i/textureSize)) / textureSize+offsetPix;
vec4 pnt = texture2D(locs, pntLoc);
if (distance(currData.xy, pnt.xy) <= clusterDistance)
{
break;
}
}
}
// Print the result
gl_FragColor = vec4(currData.x, currData.y, clusterNum, 1.);
}
But the problem here is that the result can cause a chain clustering. For ex.
if our data is {0,0}, {4,0}, {8,0}, and the max distance to group is 4. Then the first is closed to the second. and then the third is close to the second but not the first. according to my algo, it is returning the index of the second, although that second is out of the picture because is grouped by the first object, and the first is the reference object for distances.
Is it possible to read from the result texture while writing to it?
It would solve my problem, cause then i could check the z value of the result when comparing distances..
No, you cannot read and write to a texture in the same pass (with standard WebGL and I think not at all in the way you intend).
Your algorithm seems rather serial in nature, not well suited for GPU/SIMD execution, but I may misinterpret your intent. Remember that the GPU may run a shader program for multiple data-points (fragments/pixels in this case) at once, having no clue about the results of others.
You also can't break out of a for loop on a SIMD architecture. The for loop will just keep iterating although the changes will not be written for fragments that broke out of it. In other words there is no speed benefit. It's a different story if the break condition evaluates to the same value for all fragments.
You might want to look at other ways of clustering, like k-means.

Numeric Stability with Summed Area Tables in Shadow Mapping

Im having issue with loss of precision in my SAVSM setup.
when you see the light moving around the effect is very striking; there is a lot of noise with fragments going black and white all the time. This can be somewhat lessened by using the minvariance (thus ignoring anything below a certain threshold) but then we get even worse effects with incorrect falloff (see my other post).
Im using GLSL 1.2 because I'm on a mac so I dont have access to the modf function in order to split the precision across two channels as described in GPU Gems 3 Chapter 8.
Im using GL_RGBA32F_ARB textures with a Framebuffer object and ping ponging two textures to generate a summed area table which i use with the VSM algorithm.
Moments / Depth Shader to create the basis for the tables
varying vec4 v_position;
varying float tDepth;
float g_DistributeFactor = 1024.0;
void main()
{
// Is this linear depth? I would say yes but one can't be utterly sure.
// Could try a divide by the far plane?
float depth = v_position.z / v_position.w ;
depth = depth * 0.5 + 0.5; //Don't forget to move away from unit cube ([-1,1]) to [0,1] coordinate system
vec2 moments = vec2(depth, depth * depth);
// Adjusting moments (this is sort of bias per pixel) using derivative
float dx = dFdx(depth);
float dy = dFdy(depth);
moments.y += 0.25 * (dx*dx+dy*dy);
// Subtract 0.5 off now so we can get this into our summed area table calc
//moments -= 0.5;
// Split the moments into rg and ba for EVEN MORE PRECISION
// float FactorInv = 1.0 / g_DistributeFactor;
// gl_FragColor = vec4(floor(moments.x) * FactorInv, fract(moments.x ) * g_DistributeFactor,
// floor(moments.y) * FactorInv, fract(moments.y) * g_DistributeFactor);
gl_FragColor = vec4(moments,0.0,0.0);
}
The shadowmap shader
varying vec4 v_position;
varying float tDepth;
float g_DistributeFactor = 1024.0;
void main()
{
// Is this linear depth? I would say yes but one can't be utterly sure.
// Could try a divide by the far plane?
float depth = v_position.z / v_position.w ;
depth = depth * 0.5 + 0.5; //Don't forget to move away from unit cube ([-1,1]) to [0,1] coordinate system
vec2 moments = vec2(depth, depth * depth);
// Adjusting moments (this is sort of bias per pixel) using derivative
float dx = dFdx(depth);
float dy = dFdy(depth);
moments.y += 0.25 * (dx*dx+dy*dy);
// Subtract 0.5 off now so we can get this into our summed area table calc
//moments -= 0.5;
// Split the moments into rg and ba for EVEN MORE PRECISION
// float FactorInv = 1.0 / g_DistributeFactor;
// gl_FragColor = vec4(floor(moments.x) * FactorInv, fract(moments.x ) * g_DistributeFactor,
// floor(moments.y) * FactorInv, fract(moments.y) * g_DistributeFactor);
gl_FragColor = vec4(moments,0.0,0.0);
}
The Summed tables do seem to be working. I know this because I have a function that converts back from the summed table to the original depth map and the two images do look pretty much the same. Im also using the -0.5 + 0.5 trick in order to get some more precision but it doesnt seem to be helping
My question is this, given that im on a mac which has GLSL 1.2 only, how can I split the precision over two channels? If I could use these extra channels for space in the summed table then maybe that would work? Ive seen some stuff that uses modf but that isnt available to me.
Also, people have suggested 32 bit integer buffers but I dont think I have support for these on my macbook pro.

Resources