Clustering objects in GPU - algorithm

My algorithm is simple for clustering, and it goes like this.
First object is grouped by all other objects which the distance between them is lower the X.
Then we go to the second object, if not included in the first group, we run the same algorithm on the other objects that are not included in the first group,
and so on...
I'm trying to do this algo in the GPU using the fragment shader.
First I set all the locations into a RGBA float texture. Setting for each pixel the location (x,y) - z and w are free for now. Then i draw to a result texture my calculations using the shader. In the end i will read the pixels of the result texture and do my code.
Tried many variations of code, and multi phases draw for performing my algorithm but i'm not happy with the time performances.
The question is,
Is there a way to do one run over the texture to perform my wish (single draw phase) ?
My latest try is this algorithm - My fragment shader
precision highp float;
uniform sampler2D locs;
varying vec2 coord;
uniform float clusterDistance;
const float textureSize = 64.;
void main()
// Getting my location
vec4 currData = texture2D(locs, coord);
float offsetPix = 1./textureSize/2.;
vec2 coordIdx = (coord - offsetPix) * textureSize;
// Getting the index of my location
float myIdx = coordIdx.y * textureSize + coordIdx.x;
int clusterIdx = 0;
float clusterNum = 0.;
// Running over all the other locations until me and finding the first close object to me
for (float i=0.;i<textureSize*textureSize;++i)
clusterNum = i +1.;
// Which mean that we didn't find any closed object to me so we stop
if (i == myIdx)
vec2 pntLoc = vec2(mod(i, textureSize), floor(i/textureSize)) / textureSize+offsetPix;
vec4 pnt = texture2D(locs, pntLoc);
if (distance(currData.xy, pnt.xy) <= clusterDistance)
// Print the result
gl_FragColor = vec4(currData.x, currData.y, clusterNum, 1.);
But the problem here is that the result can cause a chain clustering. For ex.
if our data is {0,0}, {4,0}, {8,0}, and the max distance to group is 4. Then the first is closed to the second. and then the third is close to the second but not the first. according to my algo, it is returning the index of the second, although that second is out of the picture because is grouped by the first object, and the first is the reference object for distances.
Is it possible to read from the result texture while writing to it?
It would solve my problem, cause then i could check the z value of the result when comparing distances..

No, you cannot read and write to a texture in the same pass (with standard WebGL and I think not at all in the way you intend).
Your algorithm seems rather serial in nature, not well suited for GPU/SIMD execution, but I may misinterpret your intent. Remember that the GPU may run a shader program for multiple data-points (fragments/pixels in this case) at once, having no clue about the results of others.
You also can't break out of a for loop on a SIMD architecture. The for loop will just keep iterating although the changes will not be written for fragments that broke out of it. In other words there is no speed benefit. It's a different story if the break condition evaluates to the same value for all fragments.
You might want to look at other ways of clustering, like k-means.


Rendering to custom FrameBuffer using same texture both as input and output

Some Fragment shaders in ShaderToy (e.g. fluid dynamics, ) use same buffer as both input and output. But when I try to do this in my C/C++ code it does not work (I renders strange checkerboard artifacts like inconsistent visual memory). To workaround this issue I have to use two different FrameBuffers A,B and flip textures ( first render A to B then render B back to A )
I understand that OpenGL does not allow to use the same texture both as input and output (?) due to memory consistency issues.
But isn't there more elegant solution than using two FrameBuffers ? E.g. using some lock, or temporary cache (I don't know some sychronization flag which takes care of this)???
EDIT - Details to answer the comment/question:
OpenGL (depending the GL version) has some very specific rules of what
can and can''t be done when the same texture is used as render target
and sampler input. If your use case can be implemented within this set
of requirements or not is not clear, as you have not explained what
exactly you need or want to do here.
basically I want to implement Fluid-Dynamics solver (e.g. that from ShaderToy linked above) as well as other partial differential equation solvers. That means each pixel output depends on some convolution mask (derivative, laplacian, average) of neighboring pixels. There may be also some movement (advection) which means reading values form distant pixels.
Currently I realized the artifacts appear mostly when I read/write pixels which are different place - i.e. it is non-local (e.g. pixel[100,100] depend on pixel[10,10])
Example of simple Fluid-Solver from Shadertoy:
vec4 solveFluid(sampler2D smp, vec2 uv, vec2 w, float time, vec3 mouse, vec3 lastMouse)
const float K = 0.2;
const float v = 0.55;
vec4 data = textureLod(smp, uv, 0.0);
vec4 tr = textureLod(smp, uv + vec2(w.x , 0), 0.0);
vec4 tl = textureLod(smp, uv - vec2(w.x , 0), 0.0);
vec4 tu = textureLod(smp, uv + vec2(0 , w.y), 0.0);
vec4 td = textureLod(smp, uv - vec2(0 , w.y), 0.0);
vec3 dx = ( -*0.5;
vec3 dy = ( -*0.5;
vec2 densDif = vec2(dx.z ,dy.z);
data.z -= dt*dot(vec3(densDif, dx.x + dy.y) ,; //density
vec2 laplacian = tu.xy + td.xy + tr.xy + tl.xy - 4.0*data.xy;
vec2 viscForce = vec2(v)*laplacian;
data.xyw = textureLod(smp, uv - dt*data.xy*w, 0.).xyw; //advection
vec2 newForce = vec2(0);
data.xy += dt*(viscForce.xy - K/dt*densDif + newForce); //update velocity
data.xy = max(vec2(0), abs(data.xy)-1e-4)*sign(data.xy); //linear velocity decay
data.w = (tr.y - tl.y - tu.x + td.x);
vec2 vort = vec2(abs(tu.w) - abs(td.w), abs(tl.w) - abs(tr.w));
vort *= VORTICITY_AMOUNT/length(vort + 1e-9)*data.w;
data.xy += vort;
data.y *= smoothstep(.5,.48,abs(uv.y-0.5)); //Boundaries
data = clamp(data, vec4(vec2(-10), 0.5 , -10.), vec4(vec2(10), 3.0 , 10.));
return data;
Currently I realized the artifacts appear mostly when I read/write pixels which are different place - i.e. it is non-local (e.g. pixel[100,100] depend on pixel[10,10])
Yes, this is never going to work on GPUs, as there are no particular guarantees on the order of individual fragment shader invocations whatsoever. So if the invocation writing to pixel [100,100] will see the results of the invocation writing to [10,10] or the original data will be totally random. As per the spec, you're getting undefined values when reading in such a cuncurrent read/write scenario, so theoretically, you could get even not one or the other, but see partial writes or totally different values (although that's not likely to occur on real world hardware).
And any order guarantees of such a scale simply does not make sense within the render pipeline, so there is also no partical means of synchronization you can manually add to solve this issue.
To workaround this issue I have to use two different FrameBuffers A,B and flip textures ( first render A to B then render B back to A )
Yes, the ping-pong approach is what you should do for this use case. And honestly, it should not incur any significant performance penalty in that scenario anyway, as you seem to write to each output pixel once anyway, so you don't need an additional copy of "untouched" pixels. So all it costs is the additional memory.

Metal emulate geometry shaders using compute shaders

I'm trying to implement voxel cone tracing in Metal. One of the steps in the algorithm is to voxelize the geometry using a geometry shader. Metal does not have geometry shaders so I was looking into emulating them using a compute shader. I pass in my vertex buffer into the compute shader, do what a geometry shader would normally do, and write the result to an output buffer. I also add a draw command to an indirect buffer. I use the output buffer as the vertex buffer for my vertex shader. This works fine, but I need twice as much memory for my vertices, one for the vertex buffer and one for the output buffer. Is there any way to directly pass the output of the compute shader to the vertex shader without storing it in an intermediate buffer? I don't need to save the contents of the output buffer of the compute shader. I just need to give the results to the vertex shader.
Is this possible? Thanks
Essentially, I'm trying to emulate the following shader from glsl:
#version 450
layout(triangles) in;
layout(triangle_strip, max_vertices = 3) out;
layout(location = 0) in vec3 in_position[];
layout(location = 1) in vec3 in_normal[];
layout(location = 2) in vec2 in_uv[];
layout(location = 0) out vec3 out_position;
layout(location = 1) out vec3 out_normal;
layout(location = 2) out vec2 out_uv;
void main()
vec3 p = abs(cross(in_position[1] - in_position[0], in_position[2] - in_position[0]));
for (uint i = 0; i < 3; ++i)
out_position = in_position[i];
out_normal = in_normal[i];
out_uv = in_uv[i];
if (p.z > p.x && p.z > p.y)
gl_Position = vec4(out_position.x, out_position.y, 0, 1);
else if (p.x > p.y && p.x > p.z)
gl_Position = vec4(out_position.y, out_position.z, 0, 1);
gl_Position = vec4(out_position.x, out_position.z, 0, 1);
For each triangle, I need to output a triangle with vertices at these new positions instead. The triangle vertices come from a vertex buffer and is drawn using an index buffer. I also plan on adding code that will do conservative rasterization (just increase the size of the triangle by a little bit) but it's not shown here. Currently what I'm doing in the Metal compute shader is using the index buffer to get the vertex, do the same code in the geometry shader above, and outputting the new vertex in another buffer which I then use to draw.
Here's a very speculative possibility depending on exactly what your geometry shader needs to do.
I'm thinking you can do it sort of "backwards" with just a vertex shader and no separate compute shader, at the cost of redundant work on the GPU. You would do a draw as if you had a buffer of all of the output vertices of the output primitives of the geometry shader. You would not actually have that on hand, though. You would construct a vertex shader that would calculate them in flight.
So, in the app code, calculate the number of output primitives and therefore the number of output vertices that would be produced for a given count of input primitives. Do a draw of the output primitive type with that many vertices.
You would not provide a buffer with the output vertex data as input to this draw.
You would provide the original index buffer and original vertex buffer as inputs to the vertex shader for that draw. The shader would calculate from the vertex ID which output primitive it's for, and which vertex of that primitive (e.g. for a triangle, vid / 3 and vid % 3, respectively). From the output primitive ID, it would calculate which input primitive would have generated it in the original geometry shader.
The shader would look up the indices for that input primitive from the index buffer and then the vertex data from the vertex buffer. (This would be sensitive to the distinction between a triangle list vs. triangle strip, for example.) It would apply any pre-geometry-shader vertex shading to that data. Then it would do the part of the geometry computation that contributes to the identified vertex of the identified output primitive. Once it has calculated the output vertex data, you can apply any post-geometry-shader vertex shading(?) that you want. The result is what it would return.
If the geometry shader can produce a variable number of output primitives per input primitive, well, at least you have a maximum number. So, you can draw the maximum potential count of vertices for the maximum potential count of output primitives. The vertex shader can do the computations necessary to figure out if the geometry shader would have, in fact, produced that primitive. If not, the vertex shader can arrange for the whole primitive to be clipped away, either by positioning it outside of the frustum or using a [[clip_distance]] property of the output vertex data.
This avoids ever storing the generated primitives in a buffer. However, it causes the GPU to do some of the pre-geometry-shader vertex shader and geometry shader calculations repeatedly. It will be parallelized, of course, but may still be slower than what you're doing now. Also, it may defeat some optimizations around fetching indices and vertex data that may be possible with more normal vertex shaders.
Here's an example conversion of your geometry shader:
#include <metal_stdlib>
using namespace metal;
struct VertexIn {
// maybe need packed types here depending on your vertex buffer layout
// can't use [[attribute(n)]] for these because Metal isn't doing the vertex lookup for us
float3 position;
float3 normal;
float2 uv;
struct VertexOut {
float3 position;
float3 normal;
float2 uv;
float4 new_position [[position]];
vertex VertexOut foo(uint vid [[vertex_id]],
device const uint *indexes [[buffer(0)]],
device const VertexIn *vertexes [[buffer(1)]])
VertexOut out;
const uint triangle_id = vid / 3;
const uint vertex_of_triangle = vid % 3;
// indexes is for a triangle strip even though this shader is invoked for a triangle list.
const uint index[3] = { indexes[triangle_id], index[triangle_id + 1], index[triangle_id + 2] };
const VertexIn v[3] = { vertexes[index[0]], vertexes[index[1]], vertexes[index[2]] };
float3 p = abs(cross(v[1].position - v[0].position, v[2].position - v[0].position));
out.position = v[vertex_of_triangle].position;
out.normal = v[vertex_of_triangle].normal;
out.uv = v[vertex_of_triangle].uv;
if (p.z > p.x && p.z > p.y)
out.new_position = float4(out.position.x, out.position.y, 0, 1);
else if (p.x > p.y && p.x > p.z)
out.new_position = float4(out.position.y, out.position.z, 0, 1);
out.new_position = float4(out.position.x, out.position.z, 0, 1);
return out;
Unfortunately there is no way to do this (and other things) in Metal, without going into unneeded complications.
The API lacks critical features that are common in Vulkan, OpenGL and DirectX...

how can i iterate with loop in sampler2D

I have some data encoded in a floating point texture 2k by 2k. The data are longitude, latitude, time, and date as R,G,B,A. Those are all normalized but for now that is not a problem. I can de-normalize them later if i want to.
What i need now is to iterate through the whole texture and find what longitude, latitude should be in that fragment coordinate. I assume that the whole atlas has normalized coordinates and it maps the whole openGL context. Besides coordinates i will filter data with time and date but that is an if condition that is easy to be done. Because pixel coordinates that i have will not map exactly that coordinate i will use a small delta value to fix that issue for now and i will sue that delta value to precompute other points that are close to that co.
Now i have some driver crashes on iGPU (it should be out of memory or something similar) even if i want to add something in 2 for nested loops or even if I use a discard.
The code i now is this
NOTE f_time is the filter for the time and for now i have a slider so that i will have some interaction with the values.
precision mediump float;
precision mediump int;
const int maxTextureSize = 2048;
varying vec2 v_texCoord;
uniform sampler2D u_texture;
uniform float f_time;
uniform ivec2 textureDimensions;
void main(void) {
float delta = 0.001;// now bigger delta just to make it work then we tune it
// compute 1 pixel in texture coordinates.
vec2 onePixel = vec2(1.0, 1.0) / float(textureDimensions.x);
vec2 position = ( gl_FragCoord.xy / float(textureDimensions.x) );
vec4 color = texture2D(u_texture, v_texCoord);
vec4 outColor = vec4(0.0);
float dist_x = distance( color.r, gl_FragCoord.x);
float dist_y = distance( color.g, gl_FragCoord.y);
//float dist_x = distance( color.g, gl_PointCoord.s);
//float dist_y = distance( color.b, gl_PointCoord.t);
for(int i = 0; i < maxTextureSize; i++){
if(i < textureDimensions.x ){
for(int j = 0; j < maxTextureSize ; j++){
if(j < textureDimensions.y ){
// Where i am stuck now how to get the texture coordinate and test it with fragment shader
// the precomputation
vec4 pixel= texture2D(u_texture,vec2(i,j));
if(pixel.r > f_time){
outColor = vec4(1.0, 1.0, 1.0, 1.0);
// for now just break, no delta calculation to sum this point with others so that
// we will have an approximation of other points into that pixel
// this works
if(color.t > f_time){
//gl_FragColor = color;//;vec4(1.0, 1.0, 1.0, 1.0);
gl_FragColor = outColor;
What you are trying to do is simply not feasible.
You are trying to access a texture up to four million times, all within a single fragment shader invocation.
The way modern GPUs usually detect infinite loop conditions is by seeing how long your shader runs, and then killing it if it has run for "too long", the length of which is usually sufficiently generous. Your code, which does up to 4 million texture accesses, will almost certainly trigger this condition.
Which typically leads to a GPU reset.
Generally speaking, the way you would find the position in a texture which is associated with some fragment is to do so directly. That is, create a 1:1 correspondence between screen fragment locations (gl_FragCoord) and texels in the texture. That way, your texture does not need to contain X/Y coordinates, and each fragment shader can access the data meant for that specific invocation.
What you're trying to do seems to be to pass a large table (four million elements) to the GPU, and then have the GPU process it. The ordering of values is (generally) irrelevant; any value could potentially modify any pixel. Some pixels don't have values applied to them, while others may have multiple values applied.
This is serial programmer thinking, not parallel thinking. The way you'd code that on the CPU is to walk each element in the table, look at where it goes, and build the results for each pixel.
In a parallel algorithm, you don't work that way. Each invocation needs to be able to instantly find the data in the table that applies to it. You should never be doing some kind of search through a table for your data. Especially not a linear search.
You need to think of this from the perspective of your fragment shader.
In your data table, for each position on the screen, there is a list of data values that apply to that screen position. Correct? What you need to do is make that list directly available to each fragment shader invocation. And since each fragment's list is not constant in size, you will need to use a linked list rather than a fixed-size array.
To do this, you build a texture the size of your render target. Each texel in the texture specifies the location in the data table of the first element that this fragment needs to process. This provides every fragment shader invocation with the location of its first element. Since some fragment shaders may have no data applied to them, you need to set aside some special texture coordinate value to represent "none".
The data in the data table consists of your time and date, but rather than "longitude/latitude", it has the texture coordinate of the next texel in the texture that applies for that fragment shader. This is how you make a linked list in shaders. Each location in the data table specifies the next location to be processed.
If that location was the last data to be processed, then the location will be the "none" value from before.
You should also be using a buffer texture or an SSBO to hold your data table, rather than a 2D texture. It would make things much easier.

opengl artifacts using soubroutine

I draw my scene using glDrawElements function.
Since I want to achieve situation, where one draw call draws complete scene,
I need to make shader which switches between "materials" in shader.
I decided to use soubroutine for materials. Here is my fragment shader.
#version 440
layout(location = 0) flat in uvec2 inID_ShaderData;
layout(location = 1) in vec4 inPosition;
layout(location = 2) in vec2 inUV;
subroutine void shaderType(void);
subroutine uniform shaderType shaders[2];
uniform sampler2D texture0;
layout(location = 0) out vec4 outColor;
layout(location = 1) flat out uint outID;
subroutine(shaderType) void shader_flatColor(void)
outColor = vec4(1,0,0,1); // test red color
// outColor = unpackUnorm4x8(inID_ShaderData.y & 0x00ffffff); // this should be here normally
subroutine(shaderType) void shader_flatTexture(void)
outColor = vec4(0,0,1,1); // test blue color
// outColor = texture(texture0, inUV); // this should be here normally
void main()
uint shader = (inID_ShaderData.y >> 24) & 0xff; // extract subroutine index from attributes
shaders[ shader ](); // call subroutine - not working, makes artifacts
/* calling subroutine this way works ok
if (shader == 0) shaders[ 0 ]();
if (shader == 1) shaders[ 1 ]();
outID = inID_ShaderData.x;
if (outID == -1) // this condition never happens
outColor = texture(texture0, inUV); // needed here to not to optimize out texture0, needed in soubroutine
Question 1:
When using shaders [ shader ] (); then there are pixel artifacts on quads drawn.
When using IFs, then it works OK. Is it driver bug, or am I doing something wrong?
How can this be achieved without IFs, using subroutines ?
(I have Radeon 7850 on Windows 8 64 bit)
In second soubroutine I want to use texture. But if I don't use this sampler variable
in main(), then compiler "does not see it" in subroutine and on cpu-side glUniform
function fails.
Is there some way how to do right? Without compiler cheats, e.g. never happening conditions?
P.S.: Sorry I can not post image with artifacts, but what should be red squares are
red squares with random blue pixels on some 5% of area mostly in corners.
Blue squares have red pixels.
You simply cannot do that. Switching subroutines is not possible on a per-attribute basis. The GLSL spec has this to say about your attempt:
Subroutine variables may be declared as explicitly-sized arrays, which
can be indexed only with dynamically uniform expressions.
Thinking about how GPUs work, this restriction does make total sense. There simply is no separate control flow for every single shader invocation, but only for much larger groups.
When you use the if attempt, you are using constant indices, which of course are also dynamically uniform. You of course could try to brach based on your input attribute, but you should be aware that forcing non-uniform control flow that way might totally ruin performance. In the worst case, this will not significantly more efficient than running all executing all the subroutine functions all the time, storing the result in some array, and select the final result from that array using your index.

GLSL shader algorithm optimization

Is there a anyway to optimize the next algorithm to be any faster, even if is just a small speed increase?
const mat3 factor = mat3(1.0, 1.0, 1.0, 2.112, 1.4, 0.0, 0.0, 2.18, -2.21);
vec3 calculate(in vec2 coord)
vec3 sample = texture2D(texture_a, coord).rgb;
return (factor / sample) * 2.15;
The only significant optimization I can think of is to pack texture_a and texture_b into a single three-channel texture, if you can. That saves you one of the two texture lookups, which are most likely to be the bottleneck here.
#Thomas answer is the most helpfull, since texture lookups are most expensive, if his solution is possible in your application. If you already use those textures somewhere else better pass the values as parameters to avoid duplicate lookups.
Else I don't know if it can be optimized that much but some straight forward things that come to my mind.
Compiler optimizations:
Assign const keyword to coord parameter, if possible to sample too.
Assign f literal in each float element.
Maybe manually assign mat
I don't know if its faster because I don't know how the matrix multiplication is implemented but since the constant factor matrix contains many ones and zeros it maybe can be manually assigned.
vec3 calculate(const in vec2 coord)
//not 100% sure if that init is possible
const vec3 sample = vec3(texture2D(texture_a, coord).r
texture2D(texture_b, coord).ra - 0.5f);
vec3 result = vec3(sample.y);
result.x += sample.x + sample.z;
result.y += 2.112f * sample.x;
result.z *= 2.18f;
result.z -= 2.21f * sample.z;
return result;
