Perfomance depending on index type - performance

I was playing around with "drawing" millions of triangles and found something interesting: switching type of indices from VK_INDEX_TYPE_UINT32 to VK_INDEX_TYPE_UINT16 increased amount of triangles being drawn per second by 1.5 times! I want to know, how is the difference in speed so large?
I use indirect indexed instanced (so much i) drawing: 25 vertices, 138 indices(46 triangles), 2^21~=2M instances(I am too lazy to seek where to disable vSync), 1 draw per frame. 96'468'992 triangles per frame total. To get the clearest results I look away from the triangles (discarding rasterisation has pretty much same performance)
I have very simple vertex shader:
layout(set = 0, binding = 0) uniform A
{
mat4 cam;
};
layout(location = 0)in vec3 inPosition;//
layout(location = 1)in vec4 inColor; //Color and position are de-interleaved
layout(location = 2)in vec3 inGlob; //
layout(location = 3)in vec4 inQuat; //data per instance, interleaved
layout(location = 0)out vec4 fragColor;
vec3 vecXquat(const vec3 v, const vec4 q)
{// function rotating vector by quaternion
return v + 2.0f *
cross(q.xyz,
cross(q.xyz, v)
+ q.w * v);
}
void main(){
gl_Position = vec4(vecXquat(inPosition, inQuat)+inGlob, 1.0f)*cam;
fragColor = inColor;
}
and pass-through fragment shader.
Primitives - VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST
The results:
~1950MTris/s with 32bit indices
~2850MTris/s with 16bit indices
GPU - GTX1050Ti

Since your shaders are so simple, your rendering performance will likely be dominated by factors that would otherwise be more trivial, like vertex data transfer rate.
138 indices have to be read by the GPU for each instance. With 2 million instances, that's 1.02GB of just index data that has to be read by the GPU with 32-bit indices. Of course, for 16-bit indices, the transfer rate is halved. And with half as much data, there's a better chance that the index data all manages to fit entirely in the vertex pulling cache.

Related

How to optimize texture fetching in openGL-ES

I want to draw a map in shader, which is composed by five textures. One is for the ratios of each texture(CC_Texture0: 512x512), and the other four are different elements(u_tex_r,u_tex_g,u_tex_b,u_tex_a: 256x256).
my shader is:
varying vec4 v_fragmentColor;
#ifdef GL_ES
varying highp vec2 v_uv0;
#else
varying vec2 v_uv0;
#endif
//uniform vec2 repeated_r;
//uniform vec2 repeated_g;
//uniform vec2 repeated_b;
//uniform vec2 repeated_a;
uniform vec4 u_repeat_1;
uniform vec4 u_repeat_2;
uniform sampler2D u_tex_r;
uniform sampler2D u_tex_g;
uniform sampler2D u_tex_b;
uniform sampler2D u_tex_a;
void main()
{
vec4 mix = texture2D(CC_Texture0, v_uv0);
vec4 r = texture2D(u_tex_r, vec2(v_uv0.x * 20, v_uv0.y * 20));
vec4 g = texture2D(u_tex_g, vec2(v_uv0.x * 20, v_uv0.y * 20));
vec4 b = texture2D(u_tex_b, vec2(v_uv0.x * 20, v_uv0.y * 20));
vec4 a = texture2D(u_tex_a, vec2(v_uv0.x * 20, v_uv0.y * 20));
gl_FragColor = vec4((r * mix.r + g * mix.g + b * mix.b + a * mix.a).rgb, 1.0);
//gl_FragColor = vec4((r * mix.r + g * mix.g + b * mix.b).rgb, 1.0);
}
Device is samsung G5308W, frame rate is only 50 fps, even I delete the scale. When I just print CC_Texture0, frame rate can be 60fps. why? GPU memory or scale? Anybody can help to improve it?
Clearly your shader is pretty simple in terms of computation, so I suspect that the thing that is limiting performance is the cost of fetching the teture data. This is not unusual on mobile, memory bandwidth is often the limiting factor for frame-rate and also one of the top contributors to energy use affecting battery life and device heat.
Some suggestions:
You mentioned that removing the scale doesn't help performance (I presume you mean removing the * 20 when constructing the UVs). Even if it doesn't have a measurable impact on this device I'd still recommend avoiding the dependent texture read as it will probably slightly improve performance and may improve performance a lot on some older devices. Add a second set of UVs which you can calculate on the vertex shader and pass in as a varying. If you need different scales for each texture then I'd add 4 new varyings. Varyings are very cheap. Don't be tempted to encode multiple UVs into vec4, that might cause dependent texture reads.
You mention the texture resolution but not the texture format. If you are able to lower the bits-per-pixel of these textures as much as possible you will see a big impact on performance. Compressed textures (e.g. ETC1 which is 4bpp) are best, but even switching from 8888 to 565 or 4444 will also help a lot.
Could you use a simpler shader in some cases? You haven't mentioned the context, but this has the feel of terrain texture splatting. In terrain you often find that very few chunks of geometry actually use all of the tiled textures. If you can identify geometries which reference fewer textures and use a more specialized fragment shader for those which reference fewer textures, then you would get a good performance boost. The tradeoff is more complex code and potentially more draw calls and state changes which might impact CPU time so it can be a tricky balancing act.
Finally, the G5308W is a 6 year old device, if you can't hit 60fps on it, then it isn't the end of the world.

GLSL: memory exhausted

I am working on a WebGL scene with ~100 different 2048 x 2048 px textures. I'm rendering points primitives, and each point has a texture index and texture uv offsets that indicate the region of the given texture that should be used on the point.
Initially, I attempted to pass each point's texture index as a varying value, then I attempted to pull the given texture from a sampler2D array using that index position. However, this yielded an error that one can only fetch sampler2D array values with a "constant integer expression", so now I'm using a gnarly if conditional to assign each point's texture index:
/**
* The fragment shader's main() function must define `gl_FragColor`,
* which describes the pixel color of each pixel on the screen.
*
* To do so, we can use uniforms passed into the shader and varyings
* passed from the vertex shader.
*
* Attempting to read a varying not generated by the vertex shader will
* throw a warning but won't prevent shader compiling.
**/
// set float precision
precision highp float;
// repeat identifies the size of each image in an atlas
uniform vec2 repeat;
// textures contains an array of textures with length n textures
uniform sampler2D textures[42];
// identify the uv values as a varying attribute
varying vec2 vUv; // blueprint uv coords
varying vec2 vTexOffset; // instance uv offsets
varying float vTexture; // set index of each object's vertex
void main() {
int textureIndex = int(floor(vTexture));
vec2 uv = vec2( gl_PointCoord.x, 1.0 - gl_PointCoord.y );
// The block below is automatically generated
if (textureIndex == 0) {vec4 color = texture2D(textures[0], uv * repeat + vTexOffset ); }
else if (textureIndex == 1) { vec4 color = texture2D(textures[1], uv * repeat + vTexOffset ); }
else if (textureIndex == 2) { vec4 color = texture2D(textures[2], uv * repeat + vTexOffset ); }
else if (textureIndex == 3) { vec4 color = texture2D(textures[3], uv * repeat + vTexOffset ); }
[ more lines of the same ... ]
gl_FragColor = color;
}
If the number of textures is small, this works fine. But if the number of textures is large (e.g. 40) this approach throws:
ERROR: 0:58: '[' : memory exhausted
I've tried reading around on this error but still am not sure what it means. Have I surpassed the max RAM in the GPU? If anyone knows what this error means, and/or what I can do to resolve the problem, I'd be grateful for any tips you can provide.
More details:
Total size of all textures to be loaded: 58MB
Browser: recent Chrome
Graphics card: AMD Radeon R9 M370X 2048 MB graphics (stock 2015 OSX card)
There is a limit on how many samplers a fragment shader can access. It can be obtained via gl.getParameter(gl.MAX_TEXTURE_IMAGE_UNITS). It is guaranteed to be at least 8, and is typically 16 or 32.
To circumvent the limit, texture arrays are available in WebGL2, which also allow indexing layers with any variable. In WebGL1 your only option are atlases, but since your textures are already 2048 by 2048, you can't make ghem any bigger.
If you don't want to limit yourself to WebGL2, you would have to split your rendering into multiple draw calls with diffferent textures set.
Also consider that having 100 8-bit RGBA 2048x2048 textures uses up 1.6 gigabytes of VRAM. Texture compression via WEBGL_compressed_texture_s3tc can reduce that by 8x or 4x, depending on how much alpha precision you need.

how can i iterate with loop in sampler2D

I have some data encoded in a floating point texture 2k by 2k. The data are longitude, latitude, time, and date as R,G,B,A. Those are all normalized but for now that is not a problem. I can de-normalize them later if i want to.
What i need now is to iterate through the whole texture and find what longitude, latitude should be in that fragment coordinate. I assume that the whole atlas has normalized coordinates and it maps the whole openGL context. Besides coordinates i will filter data with time and date but that is an if condition that is easy to be done. Because pixel coordinates that i have will not map exactly that coordinate i will use a small delta value to fix that issue for now and i will sue that delta value to precompute other points that are close to that co.
Now i have some driver crashes on iGPU (it should be out of memory or something similar) even if i want to add something in 2 for nested loops or even if I use a discard.
The code i now is this
NOTE f_time is the filter for the time and for now i have a slider so that i will have some interaction with the values.
precision mediump float;
precision mediump int;
const int maxTextureSize = 2048;
varying vec2 v_texCoord;
uniform sampler2D u_texture;
uniform float f_time;
uniform ivec2 textureDimensions;
void main(void) {
float delta = 0.001;// now bigger delta just to make it work then we tune it
// compute 1 pixel in texture coordinates.
vec2 onePixel = vec2(1.0, 1.0) / float(textureDimensions.x);
vec2 position = ( gl_FragCoord.xy / float(textureDimensions.x) );
vec4 color = texture2D(u_texture, v_texCoord);
vec4 outColor = vec4(0.0);
float dist_x = distance( color.r, gl_FragCoord.x);
float dist_y = distance( color.g, gl_FragCoord.y);
//float dist_x = distance( color.g, gl_PointCoord.s);
//float dist_y = distance( color.b, gl_PointCoord.t);
for(int i = 0; i < maxTextureSize; i++){
if(i < textureDimensions.x ){
break;
}
for(int j = 0; j < maxTextureSize ; j++){
if(j < textureDimensions.y ){
break;
}
// Where i am stuck now how to get the texture coordinate and test it with fragment shader
// the precomputation
vec4 pixel= texture2D(u_texture,vec2(i,j));
if(pixel.r > f_time){
outColor = vec4(1.0, 1.0, 1.0, 1.0);
// for now just break, no delta calculation to sum this point with others so that
// we will have an approximation of other points into that pixel
break;
}
}
}
// this works
if(color.t > f_time){
//gl_FragColor = color;//;vec4(1.0, 1.0, 1.0, 1.0);
}
gl_FragColor = outColor;
}
What you are trying to do is simply not feasible.
You are trying to access a texture up to four million times, all within a single fragment shader invocation.
The way modern GPUs usually detect infinite loop conditions is by seeing how long your shader runs, and then killing it if it has run for "too long", the length of which is usually sufficiently generous. Your code, which does up to 4 million texture accesses, will almost certainly trigger this condition.
Which typically leads to a GPU reset.
Generally speaking, the way you would find the position in a texture which is associated with some fragment is to do so directly. That is, create a 1:1 correspondence between screen fragment locations (gl_FragCoord) and texels in the texture. That way, your texture does not need to contain X/Y coordinates, and each fragment shader can access the data meant for that specific invocation.
What you're trying to do seems to be to pass a large table (four million elements) to the GPU, and then have the GPU process it. The ordering of values is (generally) irrelevant; any value could potentially modify any pixel. Some pixels don't have values applied to them, while others may have multiple values applied.
This is serial programmer thinking, not parallel thinking. The way you'd code that on the CPU is to walk each element in the table, look at where it goes, and build the results for each pixel.
In a parallel algorithm, you don't work that way. Each invocation needs to be able to instantly find the data in the table that applies to it. You should never be doing some kind of search through a table for your data. Especially not a linear search.
You need to think of this from the perspective of your fragment shader.
In your data table, for each position on the screen, there is a list of data values that apply to that screen position. Correct? What you need to do is make that list directly available to each fragment shader invocation. And since each fragment's list is not constant in size, you will need to use a linked list rather than a fixed-size array.
To do this, you build a texture the size of your render target. Each texel in the texture specifies the location in the data table of the first element that this fragment needs to process. This provides every fragment shader invocation with the location of its first element. Since some fragment shaders may have no data applied to them, you need to set aside some special texture coordinate value to represent "none".
The data in the data table consists of your time and date, but rather than "longitude/latitude", it has the texture coordinate of the next texel in the texture that applies for that fragment shader. This is how you make a linked list in shaders. Each location in the data table specifies the next location to be processed.
If that location was the last data to be processed, then the location will be the "none" value from before.
You should also be using a buffer texture or an SSBO to hold your data table, rather than a 2D texture. It would make things much easier.

Why is this OpenGL 3 vertex shader very slow?

I have the following vertex shader:
#version 150
in vec4 position;
in vec2 texture;
in int layer;
out vec2 pass_texture;
out float pass_layer;
uniform mat4 _modelToClipMatrix;
uniform float layerDepth[255];
void main (void)
{
gl_Position = _modelToClipMatrix*vec4(position.xy,layerDepth[layer]/255,position.w);
// gl_Position = _modelToClipMatrix*position;
pass_layer = float(layer);
pass_texture = texture;
}
When I use it the way it is here, my frame rate is about 7 FPS. If I use the second line (which is commented out) instead of the first, my frame rate jumps to about 50 FPS. It seems that the array lookup is the big problem here. Why is it so terribly slow? And how can I improve performance while keeping functionality?
My hardware is a ATI Radeon HD 4670 256 MB (iMac 2010 model).
My vertex structure looks like:
typedef struct
{
floatVector2 position; //2*4=8
uByteVector2 textureCoordinate; //2*1=2
GLubyte layer; //1
} PCBVertex;
and I set op the buffer in the following way:
glVertexAttribPointer((GLuint)positionAttribute, 2, GL_FLOAT, GL_FALSE, sizeof(PCBVertex), (const GLvoid *)offsetof(PCBVertex, position));
glVertexAttribPointer((GLuint)textureAttribute, 2, GL_UNSIGNED_BYTE, GL_FALSE, sizeof(PCBVertex), (const GLvoid *)offsetof(PCBVertex, textureCoordinate));
glVertexAttribIPointer(layerAttribute, 1, GL_UNSIGNED_BYTE, sizeof(PCBVertex), (const GLvoid *)offsetof(PCBVertex, layer));
Some background information:
I'm working on a drawing package. The user can draw on multiple layers. One layer is active at a time and it's drawn front-most. He can also "flip" the layers, as if looking from the other side. I figured it would be inefficient to update all vertices when the layer order changes, so I give each vertex a layer number and lookup its current position in the uniform (I only send x and y as position data). Also, as a side note: the fragment shader uses the same layer number to determine the color, using a uniform array as well.
If you remove the line
gl_Position = _modelToClipMatrix*vec4(position.xy,layerDepth[layer]/255,position.w);
from your shader the uniform float layerDepth[255]; will become unused. That means the compiler will optimize it away.
Further more, the layerAttribute location will become -1, preventing any data transfers for this attrib pointer.

How can a fragment shader use the color values of the previously rendered frame?

I am learning to use shaders in OpenGL ES.
As an example: Here's my playground fragment shader which takes the current video frame and makes it grayscale:
varying highp vec2 textureCoordinate;
uniform sampler2D videoFrame;
void main() {
highp vec4 theColor = texture2D(videoFrame, textureCoordinate);
highp float avrg = (theColor[0] + theColor[1] + theColor[2]) / 3.0;
theColor[0] = avrg; // r
theColor[1] = avrg; // g
theColor[2] = avrg; // b
gl_FragColor = theColor;
}
theColor represents the current pixel. It would be cool to also get access to the previous pixel at this same coordinate.
For sake of curiousity, I would like to add or multiply the color of the current pixel to the color of the pixel in the previous render frame.
How could I keep the previous pixels around and pass them in to my fragment shader in order to do something with them?
Note: It's OpenGL ES 2.0 on the iPhone.
You need to render the previous frame to a texture, using a Framebuffer Object (FBO), then you can read this texture in your fragment shader.
The dot intrinsic function that Damon refers to is a code implementation of the mathematical dot product. I'm not supremely familiar with OpenGL so I'm not sure what the exact function call is, but mathematically a dot product goes like this :
Given a vector a and a vector b, the 'dot' product a 'dot' b produces a scalar result c:
c = a.x * b.x + a.y * b.y + a.z * b.z
Most modern graphics hardware (and CPUs, for that matter) are capable of performing this kind of operation in one pass. In your particular case, you could compute your average easily with a dot product like so:
highp vec4 = (1/3, 1/3, 1/3, 0) //or zero
I always get the 4th component in homogeneous vectors and matrices mixed up for some reason.
highp float avg = theColor DOT vec4
This will multiple each component of theColor by 1/3 (and the 4th component by 0), and then add them together.

Resources