Optimizing texture fetches with higher mip levels - performance

Let's say I have some shader program in DirectX or OpenGL rendering a full screen quad. And in a pixel/fragment shader I sample some huge textures at random texture coordinates. That is one same texture coordinate for all texture samplings in one shader invocation, but it is various among different shader invocations. These fetch operations produce performance drop, I even think that due to the size of the textures the GPU texture cache is not big enough and is used not efficiently.
Now I have a theoretical question: can I optimize the performance by using some low-resolution like 32x32 mask textures, which are built by mipmapping the large textures, and if a value in a mask texture at given texture coordinate at some higher mip level is not appropriate, then I don't need to perform texture fetches at full-size level 0? Something like this in HLSL (GLSL code is pretty similar, but there is no [branch] attribute):
float2 tc = calculateTexCoordinates();
bool performHeavyComputations = testValue(largeMipmappedTexture.SampleLevel(sampler, tc, 5));
float result = 0;
if (performHeavyComputations)
result += largeMipmappedTexture.SampleLevel(sampler, tc, 0);
About 50% of texels at mip level 5 will not pass the test. And so a lot of shader invocations should not sample the full-size textures.
But I am introducing branching in the code. May this branching hurt the performance even worse than sampling the full-size texture even if that is not needed? Different GPUs may behave differently, some may not even support branching, will they perform two fetches instead of one?
I can test this code on some machines later, but my question is theoretical.
And can you suggest another optimizations, if this won't work properly ?


Performance trade off of using a rectangular shaped texture vs. square shaped texture in Three.js material? Specifically a .basis texture

I'm wondering what the trade off is between using a texture that's 128x512 vs. a texture that's 512x512.
The texture is a skateboard deck (naturally rectangular) so I initially made the texture have an aspect ratio that made the deck appear correctly.
I'd like to use a .basis texture and read "Transcoding to PVRTC1 (for iOS) requires square power-of-two textures." on the Three.js BasisTextureLoader documentation.
So I'm trying to weigh the loading time + performance trade off between using the 128x512 as a JPG or PNG vs. a 512x512 basis texture.
My best guess is that the 128x512 would take up less memory because less texels but I've also read that the GPU likes square textures and basis is much more GPU optimized, so I'm torn between which route to take here.
Any knowledge of the performance trade offs between these two options would be highly appreciated, especially an explanation of the benfits of basis textures in general.
Three.js only really needs power-of-two textures when you're asking the texture's .minFilter to perform mip-mapping. In this case, the GPU will make several copies of the texture at half the resolution as the previous one (512, 256, 128, 64, etc...) which is why it asks for a power-of-two. The default value does perform mip-mapping, you can see alternative .minFilter values in this page under "Minification Filters". Nearest and Linear do not require P.O.T. textures, but you'll get pixellization artifacts when the texture is scaled down.
In WebGL, you can use a 512x128 without problems, since both dimensions are a power-of-two. The perfomance tradeoff is that you save a bunch of pixels that would have been stretched-out duplicates anyway.

Summed area table in GLSL and GPU fragment shader execution

I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.

Is it better to use a single texture or multiple textures for a YUV image

This question is for OpenGL ES 2.0 (on Android) but may be more general to OpenGL.
Ultimately all performance questions are implementation-dependent, but if anyone can answer this question in general or based on their experience that would be helpful. I'm writing some test code as well.
I have a YUV (12bpp) image I'm loading into a texture and color-converting in my fragment shader. Everything works fine but I'd like to see where I can improve performance (in terms of frames per second).
Currently I'm actually loading three textures for each image - one for the Y component (of type GL_LUMINANCE), one for the U component (of type GL_LUMINANCE and of course 1/4 the size of the Y component), and one for the V component (of type GL_LUMINANCE and of course 1/4 the size of the Y component).
Assuming I can get the YUV pixels in any arrangement (e.g. the U and V in separate planes or interspersed), would it be better to consolidate the three textures into only two or only one? Obviously it's the same number of bytes to push to the GPU no matter how you do it, but maybe with fewer textures there would be less overhead. At the very least, it would use fewer texture units. My ideas:
If the U and V pixels were interspersed with each other, I could load them in a single texture of type GL_LUMINANCE_ALPHA which has two components.
I could load the entire YUV image as a single texture (of type GL_LUMINANCE but 3/2 the size of the image) and then in the fragment shader I could call texture2D() three times on the same texture, doing a bit of arithmetic figure out the correct co-ordinates to pass to texture2D to get the correct texture co-ordinates for the Y, U and V components.
I would combine the data into as few textures as possible. Fewer textures is usually a better option for a few reasons.
Fewer state changes to setup the draw call.
The fewer texture fetches in a fragment shader the better.
Less upload time.
I understand some of these are focused on more specific hardware, but the principles apply to most Mobile graphics architectures.
Best Practices for Working with Texture Data
Optimize OpenGL for Tegra
Optimizing performance of a heavy fragment shader
"Binding to a texture takes time for OpenGL ES to process. Apps that reduce the number of changes they make to OpenGL ES state perform better. "
"In my experience mobile GPU performance is roughly proportional to the number of texture2D calls." "There are two texture loads, so the minimum cycle count for the texture sub-unit is two." (Tegra has a texture unit which has to run a cycle for reach texture read)
"making calls to the glTexSubImage and glCopyTexSubImage functions particularly expensive" - upload operations must stall the pipeline until textures are uploaded. It is faster to batch these into a single upload than block a bunch of separate times.

OpenGL - Fast Textured Quads?

I am trying to display as many textured quads as possible at random positions in the 3D space. In my experience so far, I cannot display even a couple of thousands of them without dropping the fps significantly under 30 (my camera movement script becomes laggy).
Right now I am following an ancient tutorial. After initializing OpenGL:
glClearColor(0, 0, 0, 0);
I set the viewpoint and perspective:
Then I load some textures:
glGenTextures(TEXTURE_COUNT, &texture[0]);
for (int i...){
glBindTexture(GL_TEXTURE_2D, texture[i]);
And finally I draw my GL_QUADS using:
glBindTexture(GL_TEXTURE_2D, q);
glTexCoord2f(d, e); glVertex3f(x1, y1, z1);
glTexCoord2f(f, g); glVertex3f(x2, y2, z2);
glTexCoord2f(h, k); glVertex3f(x3, y3, z3);
glTexCoord2f(m, n); glVertex3f(x4, y4, z4);
I find all that code very self explaining. Unfortunately that way to do things is deprecated, as far as I know. I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them. I don't even know if these objects are suited to realize what I am trying to do here (a billion quads on the screen without a lag). Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result? And if you happen to have one more minute of spare time, could you give me a short summary of how these functions are used (just as i did with the deprecated ones above)?
Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result?
What is "the result"? You have not explained very well what exactly it is that you're trying to accomplish. All you've said is that you're trying to draw a lot of textured quads. What are you trying to do with those textured quads?
For example, you seem to be creating the same texture, with the same width and height, given the same pixel data. But you store these in different texture objects. OpenGL does not know that they contain the same data. Therefore, you spend a lot of time swapping textures needlessly when you render quads.
If you're just randomly drawing them to test performance, then the question is meaningless. Such tests are pointless, because they are entirely artificial. They test only this artificial scenario where you're changing textures every time you render a quad.
Without knowing what you are trying to ultimately render, the only thing I can do is give general performance advice. In order (ie: do the first before you do the later ones):
Stop changing textures for every quad. You can package multiple images together in the same texture, then render all of the quads that use that texture at once, with only one glBindTexture call. The texture coordinates of the quad specifies which image within the texture that it uses.
Stop using glTranslate to position each individual quad. You can use it to position groups of quads, but you should do the math yourself to compute the quad's vertex positions. Once those glTranslate calls are gone, you can put multiple quads within the space of a single glBegin/glEnd pair.
Assuming that your quads are static (fixed position in model space), consider using a buffer object to store and render with your quad data.
I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them.
Did you try the OpenGL Wiki, which has a pretty good list of tutorials (as well as general information on OpenGL)? In the interest of full disclosure, I did write one of them.
I heard, in modern games milliards of polygons are rendered in real time
Actually its in the millions. I presume you're German: "Milliarde" translates into "Billion" in English.
Right now I am following an ancient tutorial.
This is your main problem. Contemporary OpenGL applications don't use ancient rendering methods. You're using the immediate mode, which means that you're going through several function calls to just submit a single vertex. This is highly inefficient. Modern applications, like games, can reach that high triangle counts because they don't waste their CPU time on calling as many functions, they don't waste CPU→GPU bandwidth with the data stream.
To reach that high counts of triangles being rendered in realtime you must place all the geometry data in the "fast memory", i.e. in the RAM on the graphics card. The technique OpenGL offers for this is called "Vertex Buffer Objects". Using a VBO you can draw large batches of geometry using a single drawing call (glDrawArrays, glDrawElements and their relatives).
After getting the geometry out of the way, you must be nice to the GPU. GPUs don't like it, if you switch textures or shaders often. Switching a texture invalidates the contents of the cache(s), switching a shader means stalling the GPU pipeline, but worse it means invalidating the execution path prediction statistics (the GPU takes statistics which execution paths of a shader are the most probable to be executed and which memory access patterns it exhibits, this used to iteratively optimize the shader execution).

Why is a Sprite Batcher faster?

I am reading Beginning Android Games (Mario Zechner) at the moment.
While reading about 2D games with OpenGL ES 1.0 the author introduces the concept of the SpriteBatcher that takes for each sprite it shall render the coordinates and an angle. The SpriteBatcher then calculates the final coordinates of the sprite rectangle and puts that into a single big buffer.
In the render method the SpriteBatcher sets the state for all the sprites once (texture, blending, vertex buffer, texture coordinates buffer). All sprites use the same texture but not the same texture coordinates.
The advantages of this behavior are:
The rendering pipeline does not stall, since there are no state changes while rendering all the sprites.
There are less OpenGL calls. (= less JNI overhead)
But I see a major disadvantage:
For rotation the CPU has to calculate the sine and cosine and perform 16 multiplication for each sprite. As far as I know calculating sine and cosine is very expensive and slow.
But the SpriteBatcher approach is lots faster than using lots of glRotate/glTranslate for rendering the sprites one by one.
Finally my questions:
Why is it faster? Are OpenGL state changes really that expensive?
The GPU is optimized for vector multiplications and rotations, while the CPU is not. Why doesn't that matter?
Would one use a SpriteBatcher on a desktop with a dedicated GFX-card?
Is there a point where the SpriteBatcher becomes inefficient?
But I see a major disadvantage:
For rotation the CPU has to calculate the sine and cosine and perform 16 multiplication for each sprite. As far as I know calculating sine and cosine is very expensive and slow.
Actually sin and cos are quite fast, on modern architectures they take 1 clock cycle to execute, if the pipeline has not been stalled before. However if the each sprite is rotated individually and an ordinary frustum perspective projection is used, the author of this code doesn't know his linear algebra.
The whole task can be simplified a lot if one recalls, that the modelview matrix maps linear local/world coordinates map to eye space. The rotation is in the upper left 3×3 submatrix, the column forming the local base vectors. By taking the inverse of this submatrix you're given exactly those vectors you need as sprite base, to map planar into eye space. In case of only rotations (and scaling, maybe) applied, the inverse of the upper left 3×3 is the transpose; so by using the upper left 3×3 rows as the sprite base you get that effect without doing any trigonometry at all:
/* populates the currently bound VBO with sprite geometry */
void populate_sprites_VBO(std::vector<vec3> sprite_positions)
GLfloat mv[16];
GLfloat sprite_left[3];
GLfloat sprite_up[3];
glGetMatrixf(GL_MODELVIEW_MATRIX, mv);
for(int i=0; i<3; i++) {
sprite_left[i] = mv[i*4];
sprite_up[i] = mv[i*4 + 4];
std::vector<GLfloat> sprite_geom;
for(std::vector<vec3>::iterator sprite=sprite_positions.begin(), end=sprite_positions.end();
sprite != end;
sprite++ ){
sprite_geom.append(sprite->x + (-sprite_left[0] - sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + (-sprite_left[1] - sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + (-sprite_left[2] - sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + ( sprite_left[0] - sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + ( sprite_left[1] - sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + ( sprite_left[2] - sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + ( sprite_left[0] + sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + ( sprite_left[1] + sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + ( sprite_left[2] + sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + (-sprite_left[0] + sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + (-sprite_left[1] + sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + (-sprite_left[2] + sprite_up[2])*sprite->scale);
sprite_positions.size() * sizeof(sprite_positions[0]), &sprite_positions[0],
If shaders are available, then instead of rebuilding the sprite data on CPU each frame, one could use the geometry shader or the vertex shader. A geometry shader would take a vector of position, scale, texture, etc. and emit the quads. Using a vertex shader you'd send a lot of [-1,1] quads, where each vertex would carry the center position of the sprite it belongs to as an additional vec3 attribute.
Finally my questions:
Why is it faster? Are OpenGL state changes really that expensive?
Some state changes are extremely expensive, you'll try to avoid those, wherever possible. Switching textures is very expensive, switching shaders is mildly expensive.
The GPU is optimized for vector multiplications and rotations, while the CPU is not. Why doesn't that matter?
This is not the difference between GPU and CPU. Where a GPU differs from a CPU is, that it performs the same sequence of operations on a huge chunk of records in parallel (each pixel of the framebuffer rendered to). A CPU on the other hand runs the program one record at a time.
But CPUs do vector operations just as well, if not even better than GPUs. Especially where precision matters CPUs are still preferred over GPUs. MMX, SSE and 3DNow! are vector math instruction sets.
Would one use a SpriteBatcher on a desktop with a dedicated GFX-card?
Probably not in this form, since today one has geometry and vertex shaders available, liberating the CPU for other things. But more importantly this saves bandwidth between CPU and GPU. Bandwidth is the tighter bottleneck, processing power is not the number one problem these days (of course one never has enough processing power).
Is there a point where the SpriteBatcher becomes inefficient?
Yes, namely the CPU → GPU transfer bottleneck. Today one uses geometry shaders and instancing to do this kind of thing, really fast.
I don't know about SpriteBatcher, but looking at the information you provided here are my thoughts:
It is faster, because it uses less state changes and, what is more important, less draw calls. Mobile platforms have especially strict constraints on draw call number per frame.
That doesn't matter because, probably, they are using CPU for rotations. I, personally, see no reason not to use GPU for that, which would be way faster and nullify bandwidth load.
I guess it would still be a good optimization considering point 1.
I can mind two extreme cases: when there are too few sprites or when the compound texture (containing all rotated sprites) grows too big (mobile devices have lower size limits).
