How to improve FPS and overcome memory bandwidth by random access on textures? - opengl-es

In my virtual reality program I am heavily bound by memory bandwidth:
#version 320 es
precision lowp float;
const int n_pool = 30;
layout(local_size_x = 8, local_size_y = 16, local_size_z = 1) in;
layout(rgba8, binding = 0) writeonly uniform lowp image2D image;
layout(rgba8, binding = 1) readonly uniform lowp image2DArray pool;
uniform mat3 RT[n_pool]; // <- this is a rotation-translation matrix
void main() {
uint u = gl_GlobalInvocationID.y;
uint v = gl_GlobalInvocationID.x;
vec4 Ir = imageLoad(pool, ivec3(u,v,29));
float cost = 1.0/0.0;
for (int j = 0; j < 16; j++) {
float C = 0.0;
for (int i = 0; i < n_pool; i++) {
vec3 w = RT[i]*vec3(u,v,j);
C += length(imageLoad(pool, ivec3(w[0],w[1],i)) - Ir);
cost = C < cost ? C : cost;
imageStore(image, ivec2(u,v), vec4(cost, cost, cost, 1.0));
You can see that I have a lot of random accesses on a TEXTURE_2D_ARRAY (width = 320, height = 240, layers = 30). However, the access is not so random, because it will be in the proximity of u,v.
Here are my thoughts:
another texture format instead rgba-floats (rgba-unsigned byte maybe?).
the shared memory is too small to even store one gray scale image.
changing loop order. Strangely, this ordering is faster although the other should to have a better caching behaviour.
resizing work groups to fit the textures better.
using compressed images (Unlikely scenario giving performance boost). In theory however, that should help with the bandwidth.
What are your thoughts?

Do you have any actual data which shows that the issue you have is texture
bandwidth, or is that just an assumption?
I can see a number of issues which mean that may well not actually be your problem. For example:
vec3 w = RT[i]*vec3(u,v,j);
... you have a mat3 array load in inside your inner loop, so on most architectures I know you probably are uniform fetch bound, not texture bound. This should cache well in the GPU data cache, but is probably still being refetched per loop iteration, which smells a lot more expensive than a single imageLoad() unless you texture format is exceptionally wide ...
If you are using fp16 or fp32 RGBA texture inputs, then narrower 8-bit unorm formats are always going to be faster (fp32 is particularly expensive).
For the following:
cost = C < cost ? C : cost;
... it's probably more reliable in terms of code generation to use the min() built in function.

moving from conventional pixel pipeline to compute shader brought 3x speedup
using compressed formats increases 5% FPS
In this simplified version I didn't show that I indeed created many temporary vectors on-the-fly (both, in the outer and the inner loop). By removing mat3/vec4/vec3 creation within loop brought 2x speedup. Very, surprising to me that creating vectors in loops is that expensive.
Now I am well deep in real-time and fulfilled my goal...


Perfomance depending on index type

I was playing around with "drawing" millions of triangles and found something interesting: switching type of indices from VK_INDEX_TYPE_UINT32 to VK_INDEX_TYPE_UINT16 increased amount of triangles being drawn per second by 1.5 times! I want to know, how is the difference in speed so large?
I use indirect indexed instanced (so much i) drawing: 25 vertices, 138 indices(46 triangles), 2^21~=2M instances(I am too lazy to seek where to disable vSync), 1 draw per frame. 96'468'992 triangles per frame total. To get the clearest results I look away from the triangles (discarding rasterisation has pretty much same performance)
I have very simple vertex shader:
layout(set = 0, binding = 0) uniform A
mat4 cam;
layout(location = 0)in vec3 inPosition;//
layout(location = 1)in vec4 inColor; //Color and position are de-interleaved
layout(location = 2)in vec3 inGlob; //
layout(location = 3)in vec4 inQuat; //data per instance, interleaved
layout(location = 0)out vec4 fragColor;
vec3 vecXquat(const vec3 v, const vec4 q)
{// function rotating vector by quaternion
return v + 2.0f *
cross(, v)
+ q.w * v);
void main(){
gl_Position = vec4(vecXquat(inPosition, inQuat)+inGlob, 1.0f)*cam;
fragColor = inColor;
and pass-through fragment shader.
The results:
~1950MTris/s with 32bit indices
~2850MTris/s with 16bit indices
GPU - GTX1050Ti
Since your shaders are so simple, your rendering performance will likely be dominated by factors that would otherwise be more trivial, like vertex data transfer rate.
138 indices have to be read by the GPU for each instance. With 2 million instances, that's 1.02GB of just index data that has to be read by the GPU with 32-bit indices. Of course, for 16-bit indices, the transfer rate is halved. And with half as much data, there's a better chance that the index data all manages to fit entirely in the vertex pulling cache.

Rendering to custom FrameBuffer using same texture both as input and output

Some Fragment shaders in ShaderToy (e.g. fluid dynamics, ) use same buffer as both input and output. But when I try to do this in my C/C++ code it does not work (I renders strange checkerboard artifacts like inconsistent visual memory). To workaround this issue I have to use two different FrameBuffers A,B and flip textures ( first render A to B then render B back to A )
I understand that OpenGL does not allow to use the same texture both as input and output (?) due to memory consistency issues.
But isn't there more elegant solution than using two FrameBuffers ? E.g. using some lock, or temporary cache (I don't know some sychronization flag which takes care of this)???
EDIT - Details to answer the comment/question:
OpenGL (depending the GL version) has some very specific rules of what
can and can''t be done when the same texture is used as render target
and sampler input. If your use case can be implemented within this set
of requirements or not is not clear, as you have not explained what
exactly you need or want to do here.
basically I want to implement Fluid-Dynamics solver (e.g. that from ShaderToy linked above) as well as other partial differential equation solvers. That means each pixel output depends on some convolution mask (derivative, laplacian, average) of neighboring pixels. There may be also some movement (advection) which means reading values form distant pixels.
Currently I realized the artifacts appear mostly when I read/write pixels which are different place - i.e. it is non-local (e.g. pixel[100,100] depend on pixel[10,10])
Example of simple Fluid-Solver from Shadertoy:
vec4 solveFluid(sampler2D smp, vec2 uv, vec2 w, float time, vec3 mouse, vec3 lastMouse)
const float K = 0.2;
const float v = 0.55;
vec4 data = textureLod(smp, uv, 0.0);
vec4 tr = textureLod(smp, uv + vec2(w.x , 0), 0.0);
vec4 tl = textureLod(smp, uv - vec2(w.x , 0), 0.0);
vec4 tu = textureLod(smp, uv + vec2(0 , w.y), 0.0);
vec4 td = textureLod(smp, uv - vec2(0 , w.y), 0.0);
vec3 dx = ( -*0.5;
vec3 dy = ( -*0.5;
vec2 densDif = vec2(dx.z ,dy.z);
data.z -= dt*dot(vec3(densDif, dx.x + dy.y) ,; //density
vec2 laplacian = tu.xy + td.xy + tr.xy + tl.xy - 4.0*data.xy;
vec2 viscForce = vec2(v)*laplacian;
data.xyw = textureLod(smp, uv - dt*data.xy*w, 0.).xyw; //advection
vec2 newForce = vec2(0);
data.xy += dt*(viscForce.xy - K/dt*densDif + newForce); //update velocity
data.xy = max(vec2(0), abs(data.xy)-1e-4)*sign(data.xy); //linear velocity decay
data.w = (tr.y - tl.y - tu.x + td.x);
vec2 vort = vec2(abs(tu.w) - abs(td.w), abs(tl.w) - abs(tr.w));
vort *= VORTICITY_AMOUNT/length(vort + 1e-9)*data.w;
data.xy += vort;
data.y *= smoothstep(.5,.48,abs(uv.y-0.5)); //Boundaries
data = clamp(data, vec4(vec2(-10), 0.5 , -10.), vec4(vec2(10), 3.0 , 10.));
return data;
Yes, this is never going to work on GPUs, as there are no particular guarantees on the order of individual fragment shader invocations whatsoever. So if the invocation writing to pixel [100,100] will see the results of the invocation writing to [10,10] or the original data will be totally random. As per the spec, you're getting undefined values when reading in such a cuncurrent read/write scenario, so theoretically, you could get even not one or the other, but see partial writes or totally different values (although that's not likely to occur on real world hardware).
And any order guarantees of such a scale simply does not make sense within the render pipeline, so there is also no partical means of synchronization you can manually add to solve this issue.
To workaround this issue I have to use two different FrameBuffers A,B and flip textures ( first render A to B then render B back to A )
Yes, the ping-pong approach is what you should do for this use case. And honestly, it should not incur any significant performance penalty in that scenario anyway, as you seem to write to each output pixel once anyway, so you don't need an additional copy of "untouched" pixels. So all it costs is the additional memory.

Efficiently Transforming from Spherical Coordinates to Cartesian Coordinates using Eigen

I need to transform the coordinates from spherical to Cartesian space using the Eigen C++ Library. The following code serves the purpose.
const int size = 1000;
Eigen::Array<std::pair<float, float>, Eigen::Dynamic, 1> direction(size);
for(int i=0; i<direction.size();i++)
direction(i).first = (i+10)%360; // some value for this example (denoting the azimuth angle)
direction(i).second = (i+20)%360; // some value for this example (denoting the elevation angle)
SSPL::MatrixX<T1> transformedMatrix(3, direction.size());
for(int i=0; i<transformedMatrix.cols(); i++)
const T1 azimuthAngle = direction(i).first*M_PI/180; //converting to radians
const T1 elevationAngle = direction(i).second*M_PI/180; //converting to radians
transformedMatrix(0,i) = std::cos(azimuthAngle)*std::cos(elevationAngle);
transformedMatrix(1,i) = std::sin(azimuthAngle)*std::cos(elevationAngle);
transformedMatrix(2,i) = std::sin(elevationAngle);
I would like to know a better implementation is possible to improve the speed.
I know that Eigen has supporting functions for Geometrical transformations. But I am yet to see a clear example to implement the same.
Is it also possible to vectorize the code to improve the performance?
You could at least use the vectorized versions of sine/cosine:
void dir2vector2(Eigen::Matrix3Xf& out, const Eigen::Array2Xf& in){
Eigen::Array2Xf sine = sin(in * (M_PI/180));
Eigen::Array2Xf cosi = cos(in * (M_PI/180));
out.resize(3, in.cols());
out << cosi.row(0) * cosi.row(1),
sine.row(0) * cosi.row(1),
There would still be a lot of optimization potential, e.g., calculating both sine and cosine of the same angle could share a lot of computation. And it is technically not necessary to store sine and cosi explicitly into temporaries (but Eigen is currently not able to automatically re-use common-sub expressions).
Also, the multiplication at the end could be vectorized better, if you store your input and output in row-major format (though the Eigen comma-initializer currently does not well with vectorization, it seems).

how can i iterate with loop in sampler2D

I have some data encoded in a floating point texture 2k by 2k. The data are longitude, latitude, time, and date as R,G,B,A. Those are all normalized but for now that is not a problem. I can de-normalize them later if i want to.
What i need now is to iterate through the whole texture and find what longitude, latitude should be in that fragment coordinate. I assume that the whole atlas has normalized coordinates and it maps the whole openGL context. Besides coordinates i will filter data with time and date but that is an if condition that is easy to be done. Because pixel coordinates that i have will not map exactly that coordinate i will use a small delta value to fix that issue for now and i will sue that delta value to precompute other points that are close to that co.
Now i have some driver crashes on iGPU (it should be out of memory or something similar) even if i want to add something in 2 for nested loops or even if I use a discard.
The code i now is this
NOTE f_time is the filter for the time and for now i have a slider so that i will have some interaction with the values.
precision mediump float;
precision mediump int;
const int maxTextureSize = 2048;
varying vec2 v_texCoord;
uniform sampler2D u_texture;
uniform float f_time;
uniform ivec2 textureDimensions;
void main(void) {
float delta = 0.001;// now bigger delta just to make it work then we tune it
// compute 1 pixel in texture coordinates.
vec2 onePixel = vec2(1.0, 1.0) / float(textureDimensions.x);
vec2 position = ( gl_FragCoord.xy / float(textureDimensions.x) );
vec4 color = texture2D(u_texture, v_texCoord);
vec4 outColor = vec4(0.0);
float dist_x = distance( color.r, gl_FragCoord.x);
float dist_y = distance( color.g, gl_FragCoord.y);
//float dist_x = distance( color.g, gl_PointCoord.s);
//float dist_y = distance( color.b, gl_PointCoord.t);
for(int i = 0; i < maxTextureSize; i++){
if(i < textureDimensions.x ){
for(int j = 0; j < maxTextureSize ; j++){
if(j < textureDimensions.y ){
// Where i am stuck now how to get the texture coordinate and test it with fragment shader
// the precomputation
vec4 pixel= texture2D(u_texture,vec2(i,j));
if(pixel.r > f_time){
outColor = vec4(1.0, 1.0, 1.0, 1.0);
// for now just break, no delta calculation to sum this point with others so that
// we will have an approximation of other points into that pixel
// this works
if(color.t > f_time){
//gl_FragColor = color;//;vec4(1.0, 1.0, 1.0, 1.0);
gl_FragColor = outColor;
What you are trying to do is simply not feasible.
You are trying to access a texture up to four million times, all within a single fragment shader invocation.
The way modern GPUs usually detect infinite loop conditions is by seeing how long your shader runs, and then killing it if it has run for "too long", the length of which is usually sufficiently generous. Your code, which does up to 4 million texture accesses, will almost certainly trigger this condition.
Which typically leads to a GPU reset.
Generally speaking, the way you would find the position in a texture which is associated with some fragment is to do so directly. That is, create a 1:1 correspondence between screen fragment locations (gl_FragCoord) and texels in the texture. That way, your texture does not need to contain X/Y coordinates, and each fragment shader can access the data meant for that specific invocation.
What you're trying to do seems to be to pass a large table (four million elements) to the GPU, and then have the GPU process it. The ordering of values is (generally) irrelevant; any value could potentially modify any pixel. Some pixels don't have values applied to them, while others may have multiple values applied.
This is serial programmer thinking, not parallel thinking. The way you'd code that on the CPU is to walk each element in the table, look at where it goes, and build the results for each pixel.
In a parallel algorithm, you don't work that way. Each invocation needs to be able to instantly find the data in the table that applies to it. You should never be doing some kind of search through a table for your data. Especially not a linear search.
You need to think of this from the perspective of your fragment shader.
In your data table, for each position on the screen, there is a list of data values that apply to that screen position. Correct? What you need to do is make that list directly available to each fragment shader invocation. And since each fragment's list is not constant in size, you will need to use a linked list rather than a fixed-size array.
To do this, you build a texture the size of your render target. Each texel in the texture specifies the location in the data table of the first element that this fragment needs to process. This provides every fragment shader invocation with the location of its first element. Since some fragment shaders may have no data applied to them, you need to set aside some special texture coordinate value to represent "none".
The data in the data table consists of your time and date, but rather than "longitude/latitude", it has the texture coordinate of the next texel in the texture that applies for that fragment shader. This is how you make a linked list in shaders. Each location in the data table specifies the next location to be processed.
If that location was the last data to be processed, then the location will be the "none" value from before.
You should also be using a buffer texture or an SSBO to hold your data table, rather than a 2D texture. It would make things much easier.

Clustering objects in GPU

My algorithm is simple for clustering, and it goes like this.
First object is grouped by all other objects which the distance between them is lower the X.
Then we go to the second object, if not included in the first group, we run the same algorithm on the other objects that are not included in the first group,
and so on...
I'm trying to do this algo in the GPU using the fragment shader.
First I set all the locations into a RGBA float texture. Setting for each pixel the location (x,y) - z and w are free for now. Then i draw to a result texture my calculations using the shader. In the end i will read the pixels of the result texture and do my code.
Tried many variations of code, and multi phases draw for performing my algorithm but i'm not happy with the time performances.
The question is,
Is there a way to do one run over the texture to perform my wish (single draw phase) ?
My latest try is this algorithm - My fragment shader
precision highp float;
uniform sampler2D locs;
varying vec2 coord;
uniform float clusterDistance;
const float textureSize = 64.;
void main()
// Getting my location
vec4 currData = texture2D(locs, coord);
float offsetPix = 1./textureSize/2.;
vec2 coordIdx = (coord - offsetPix) * textureSize;
// Getting the index of my location
float myIdx = coordIdx.y * textureSize + coordIdx.x;
int clusterIdx = 0;
float clusterNum = 0.;
// Running over all the other locations until me and finding the first close object to me
for (float i=0.;i<textureSize*textureSize;++i)
clusterNum = i +1.;
// Which mean that we didn't find any closed object to me so we stop
if (i == myIdx)
vec2 pntLoc = vec2(mod(i, textureSize), floor(i/textureSize)) / textureSize+offsetPix;
vec4 pnt = texture2D(locs, pntLoc);
if (distance(currData.xy, pnt.xy) <= clusterDistance)
// Print the result
gl_FragColor = vec4(currData.x, currData.y, clusterNum, 1.);
But the problem here is that the result can cause a chain clustering. For ex.
if our data is {0,0}, {4,0}, {8,0}, and the max distance to group is 4. Then the first is closed to the second. and then the third is close to the second but not the first. according to my algo, it is returning the index of the second, although that second is out of the picture because is grouped by the first object, and the first is the reference object for distances.
Is it possible to read from the result texture while writing to it?
It would solve my problem, cause then i could check the z value of the result when comparing distances..
No, you cannot read and write to a texture in the same pass (with standard WebGL and I think not at all in the way you intend).
Your algorithm seems rather serial in nature, not well suited for GPU/SIMD execution, but I may misinterpret your intent. Remember that the GPU may run a shader program for multiple data-points (fragments/pixels in this case) at once, having no clue about the results of others.
You also can't break out of a for loop on a SIMD architecture. The for loop will just keep iterating although the changes will not be written for fragments that broke out of it. In other words there is no speed benefit. It's a different story if the break condition evaluates to the same value for all fragments.
You might want to look at other ways of clustering, like k-means.
