OpenGL ES 3.x How to (performantly) render blended triangles front-to-back with alpha-blending and early-reject occluded fragments? - opengl-es

I recently found out that one can render alpha-blended primitives correctly not just back-to-front but also front-to-back (http://hacksoflife.blogspot.com/2010/02/alpha-blending-back-to-front-front-to.html) by using GL_ONE_MINUS_DST_ALPHA, GL_ONE, premultiplying the fragment's alpha in the fragment shader and clearing destination alpha to black before rendering.
It occurred to me that it would then be great if one could combine this with EITHER early-z rejection OR some kind of early "destination-alpha testing" in order to discard fragments that won't contribute to the final pixel color.
When rendering with front-to-back alpha-blending, a fragment can be skipped if the destination-alpha at this location already contains the value 1.0.
I did prototype-implement that by using GL_EXT_shader_framebuffer_fetch to test the destination alpha at the start of the pixel shader and then manually discard the fragment if the value is above a certain threshold. That works but it made things actually slower on my test hardware (Snapdragon XR2) - so I wonder:
whether it's somehow possible to not even have the fragment shader execute if destination alpha is already above a certain threshold?
alternatively, if it would be possible to only write to the depth buffer for fragments that are completely opaque and leave the current depth buffer value unchanged for all fragments that have an alpha value of less than 1 (but still depth-test every fragment), that should allow the hardware to use early-z rejection for occluded fragments. So,
Is this possible somehow (i.e. use depth testing, but update the depth buffer value only for opaque fragments and leave it unchanged for others)?
bottom line this would allow to reduce overdraw of alpha-blended sprites to only those fragments that contribute to the final pixel color and I wonder whether there is a performant way of doing this.

For number 2, I think you could modify gl_FragDepth in the fragment shader to achieve something close, but doing so would disable early-z rejection so wouldn't really help.
I think one viable way to reduce overdraw would be to create a tool to generate a mesh for each sprite which aims to cover a decent proportion of the opaque part of the sprite without using too many verts. I imagine for a typical sprite, even just a well placed quad could cover 80%+.
You'd render the generated opaque geometry of your sprites with depth write enabled, and do a second pass the ordinary way with depth testing enabled to cover the transparent parts.
You would massively reduce overdraw, but significantly increase the complexity of your code and number of verts rendered. You would double your draw calls, but if you're atlassing and using texture arrays, you might be doubling from 1 to 2 draw calls which is fine. I've never tried it so can't say if it's worth all the effort that would be involved.

Related

Minimum steps to implement depth-only pass

I have an existing OpenGL ES 3.1 application that renders a scene to an FBO with color and depth/stencil attachment. It uses the usual methods for drawing (glBindBuffer, glDrawArrays, glBlend*, glStencil* etc. ). My task is now to create a depth-only pass that fills the depth attachment with the same values as the main pass.
My question is: What is the minimum number of steps necessary to achieve this and avoid the GPU doing superfluous work (unnecessary shader invocations etc.)? Is deactivating the color attachment enough or do I also have to set null shaders, disable blending etc. ?
I assume you need this before the main pass runs, otherwise you would just keep the main pass depth.
Preflight
Create specialized buffers which contain only the mesh data needed to compute position (which are deinterleaved from all non-position data).
Create specialized vertex shaders (which compute only the output position).
Link programs with the simplest valid fragment shader.
Rendering
Render the depth-only pass using the specialized buffers and shaders, masking out all color writes.
Render the main pass with the full buffers and shaders.
Options
At step (2) above it might be beneficial to load the depth-only pass depth results as the starting depth for the main pass. This will give you better early-zs test accuracy, at the expense of the readback of the depth value. Most mobile GPUs have hidden surface removal, so this isn't always going to be a net gain - it depends on your content, target GPU, and how good your front-to-back draw order is.
You probably want to use the specialized buffers (position data interleaved in one buffer region, non-position interleaved in a second) for the main draw, as many GPUs will optimize out the non-position calculations if the primitive is culled.
The specialized buffers and optimized shaders can also be used for shadow mapping, and other such depth-only techniques.

OpenGL ES: Is it more efficient to use glClear or use glDrawArrays with primitives that render over used frames?

For example, if I have several figures rendered over a black background, is it more efficient to call glClear(GL_COLOR_BUFFER_BIT) each frame, or render black triangles over artifacts from the past frame? I would think that rendering black triangles over artifacts would be faster, since less pixels need to be changed than clearing the entire screen. However, I have also read online that some drivers and graphics hardware perform optimizations when using glClear(GL_COLOR_BUFFER_BIT) that cannot occur when rendering black triangles instead.
Couple things you neglected which will probably change how you look at this:
Painting over the top with triangles doesn't play well with depth testing. So you have to disable testing first.
Painting over the top with triangles has to be strictly ordered with the previous result. While glClear breaks data dependencies, allowing out of order execution to start drawing the next frame as long as there's enough memory for both framebuffers to exist independently.
Double-buffering, to avoid tearing effects, uses multiple framebuffers anyway. So to use the result of the prior frame as a starting point for the new one, not only do you have to wait for the prior frame to finish rendering, you also have to copy its framebuffer into the buffer for the new frame. Which requires touching every pixel.
In a single buffering scenario, drawing over the top of the previous frame requires not only finishing rendering, but finishing streaming it to the display device. Which leaves you only the blanking interval to render the entire new frame. That almost certainly will prevent achieving maximum frame rate.
Generally, clearing touches every pixel just as copying does, but doesn't require flushing the pipeline. So it's much better to start a frame by glFlush.

WebGL: Framebuffers and textures with one, one-byte channel?

I'm generating blurred drop shadows in WebGL by drawing the object to be blurred onto an off-screen framebuffer/texture, then applying a few passes of a filter to it (back and forth between two off-screen framebuffers), then copying the result to the final output.
However, I'm just dropping the RGB channels, overwriting them with the desired color of the drop shadow (usually black) while maintaining the alpha channel. It seems like I could probably get better performance by just having my off-screen framebuffers be a single (alpha) channel.
Is there a way to do that, and would it actually help?
Also, is there a better way to apply multiple passes of a filter than just alternating between two frame buffers and using the previous frame buffer's bound texture as the input?
Assuming WebGL follows GLES then per the spec (Page 91):
The name of the color buffer of an application-created framebuffer object
is COLOR_ATTACHMENT0 ... Color buffers consist of R, G, B, and,
optionally, A unsigned integer values.
So you can't attach only to A, or only to any single colour channel.
Options to explore:
Use colorMask to disable writing to R, G and B. Depending on what data layout your GPU uses internally you can imagine that could effectively achieve exactly what you want or possibly have no effect whatsoever.
Is there a way you could render to the depth channel instead of to the alpha channel?
Reducing memory bandwidth is often helpful but if it's not a bottleneck then you could end up prematurely optimising.
To avoid excessive per-frame ping-ponging you'd normally attempt to reform your shader so that it does the effect of all the stages in one. Otherwise consider whether there's any better than-linear way to combine multiple passes. Instead of knowing only how to get from stage n to stage n+1, can you go from stage n to stage 2n? Or even just n+2?

Efficiently rendering a transparent terrain in OpenGL

I'm writing an OpenGL program that visualizes caves, so when I visualize the surface terrain I'd like to make it transparent, so you can see the caves below. I'm assuming I can normalize the data from a Digital Elevation Model into a grid aligned to the X/Z axes with regular spacing, and render each grid cell as two triangles. With an aligned grid I could avoid the cost of sorting when applying the painter's algorithm (to ensure proper transparency effects); instead I could just render the cells row by row, starting with the farthest row and the farthest cell of each row.
That's all well and good, but my question for OpenGL experts is, how could I draw the terrain most efficiently (and in a way that could scale to high resolution terrains) using OpenGL? There must be a better way than calling glDrawElements() once for every grid cell. Here are some ways I'm thinking about doing it (they involve features I haven't tried yet, that's why I'm asking the experts):
glMultiDrawElements Idea
Put all the terrain coordinates in a vertex buffer
Put all the coordinate indices in an element buffer
To draw, write the starting indices of each cell into an array in the desired order and call glMultiDrawElements with that array.
This seems pretty good, but I was wondering if there was any way I could avoid transferring an array of indices to the graphics card every frame, so I came up with the following idea:
Uniform Buffer Idea
This seems like a backward way of using OpenGL, but just putting it out there...
Put the terrain coordinates in a 2D array in a uniform buffer
Put coordinate index offsets 0..5 in a vertex buffer (they would have to be floats, I know)
call glDrawArraysInstanced - each instance will be one grid cell
the vertex shader examines the position of the camera relative to the terrain and determines how to order the cells, mapping gl_instanceId to the index of the first coordinate of the cell in the Uniform Buffer, and setting gl_Position to the coordinate at this index + the index offset attribute
I figure there might be shiny new OpenGL 4.0 features I'm not aware of that would be more elegant than either of these approaches. I'd appreciate any tips!
The glMultiDrawElements() approach sounds very reasonable. I would implement that first, and use it as a baseline you can compare to if you try more complex approaches.
If you have a chance to make it faster will depend on whether the processing of draw calls is an important bottleneck in your rendering. Unless the triangles you render are very small, and/or your fragment shader very simple, there's a good chance that you will be limited by fragment processing anyway. If you have profiling tools that allow you to collect data and identify bottlenecks, you can be much more targeted in your optimization efforts. Of course there is always the low-tech approach: If making the window smaller improves your performance, chances are that you're mostly fragment limited.
Back to your question: Since you asked about shiny new GL4 features, another method you could check out is indirect rendering, using glDrawElementsIndirect(). Beyond being more flexible, the main difference to glMultiDrawElements() is that the parameters used for each draw, like the start index in your case, can be sourced from a buffer. This might prevent one copy if you map this buffer, and write the start indices directly to the buffer. You could even combine it with persistent buffer mapping (look up GL_MAP_PERSISTENT_BIT) so that you don't have to map and unmap the buffer each time.
Your uniform buffer idea sounds pretty interesting. I'm slightly skeptical that it will perform better, but that's just a feeling, and not based on any data or direct experience. So I think you absolutely should try it, and report back on how well it works!
Stretching the scope of your question some more, you could also look into approaches for order-independent transparency rendering if you haven't considered and rejected them already. For example alpha-to-coverage is very easy to implement, and almost free if you would be using MSAA anyway. It doesn't produce very high quality transparency effects based on my limited attempts, but it could be very attractive if it does the job for your use case. Another technique for order-independent transparency is depth peeling.
If some self promotion is acceptable, I wrote an overview of some transparency rendering methods in an earlier answer here: OpenGL ES2 Alpha test problems.

OpenGL tile rendering: most efficient way?

I am creating a tile-based 2D game as a way of learning basic "modern" OpenGL concepts. I'm using shaders with OpenGL 2.1., and am familiar with the rendering pipeline and how to actually draw geometry on-screen. What I'm wondering is the best way to organize a tilemap to render quickly and efficiently. I have thought of several potential methods:
1.) Store the quad representing a single tile (vertices and texture coordinates) in a VBO and render each tile with a separate draw* call, translating it to the correct position onscreen and using uniform2i to give the location in the texture atlas for that particular tile;
2.) Keep a VBO containing every tile onscreen (already-computed screen coordinates and texture atlas coordinates), using BufferSubData to update the tiles every frame but using a single draw* call;
3.) Keep VBOs containing static NxN "chunks" of tiles, drawing however many chunks of tiles are at least partially visible onscreen and translating them each into position.
*I'd like to stay away from the last option if possible unless rendering chunks of 64x64 is not too inefficient. Tiles are loaded into memory in blocks of that size, and even though only about 20x40 tiles are visible onscreen at a time, I would have to render up to four chunks at once. This method would also complicate my code in several other ways.
So, which of these is the most efficient way to render a screen of tiles? Are there any better methods?
You could do any one of these and they would probably be fine; what you're proposing to render is very, very simple.
#1 will definitely be worse in principle than the other options, because you would be drawing many extremely simple “models” rather than letting the GPU do a whole lot of batch work on one draw call. However, if you have only 20×40 = 800 tiles visible on screen at once, then this is a trivial amount of work for any modern CPU and GPU (unless you're doing some crazy fragment shader).
I recommend you go with whichever is simplest to program for you, so that you can continue work on your game. I imagine this would be #1, or possibly #2. If and when you find yourself with a performance problem, do whichever of #2 or #3 (64×64 sounds like a fine chunk size) lets you spend the least CPU time on your program's part of drawing (i.e. updating the buffer(s)).
I've been recently learning modern OpenGL myself, through OpenGL ES 2.0 on Android. The OpenGL ES 2.0 Programming Guide recommends an "array of structures", that is,
"Store vertex attributes together in a single buffer. The structure represents all attributes of a vertex and we have an array of these attributes per vertex."
While this may seem like it would initially consume a lot of space, it allows for efficient rendering using VBOs and flexibility in texture mapping each tile. I recently did a tiled hex grid using interleaved arrays containing vertex, normals, color, and texture data for a 20x20 tile hex grid on a Droid 2. So far things are running smoothly.

Resources