OpenGL ES, Z-Buffer, 2D sprites, discard, performance - performance

I have a retro-looking 2D game with a lot of sprites (reminiscent of Sega's Super Scaler arcades) which do not use semi-transparency. I have thought about using the Z-Buffer over sorting to simplify things. Ok, but by default writes are done to the Z-buffer even though alpha is zero, giving the effect illustrated here:
http://i.stack.imgur.com/ubLlp.png
Now, since I'm in OpenGL ES 2, I don't have alpha testing, so from what I understand my only possibility is to discard the pixel from the fragment shader if alpha is 0 so that it doesn't get written to the Z-Buffer. But in terms of performance this is SO wrong: not only the if is slow, but the discard basically kills the purpose since it disables early depth testing and the result is way worse than doing it in software.
if (val.a < 0.5) {
discard;
}
Is there any other solution I could use which would not kill the performance? Do all 2D games sort sprites themselves and not use depth buffer?

It's a tradeoff really. If you let the z-buffer do the sorting and use discard in your shaders then it's more expensive on the GPU because of branching and late depth testing as you say.
If you do the depth sorting yourself, then you'll find it's harder to issue your draw calls in an optimal order (e.g. you'll keep having to change texture). Draw calls on GLES2 have a very significant CPU hit on lower end devices and the count will probably go up.
If performance is a big concern, then probably the second option is better if you do it in conjunction with a big effort on the texture atlasing front to minimize your draw call count, this might be particularly effective if your sprites are low resolution retro sprites because you'll be able to get a lot of sprites per texture atlas. It isn't a clear winner by any stretch and I can imagine that different games take different approaches.
Also, you should take into account that the vast majority of target hardware is going to perform just fine whichever path you choose, and maybe you should just choose the one that is faster to implement and makes your code simpler (which is probably letting the z-buffer do the sorting).
If you fancy a technical challenge, I've often thought the best approach might be divide up your sprites into fully opaque sections and sections with transparency and render the two parts as separate meshes (they won't be quads any more). You'd have to do a lot of preprocessing and draw a lot more triangles, but by being able to do some rendering with fully-opaque parts then you can take advantage of the hidden-surface-removal tech in all iOS devices and lots of Android devices. Certainly by doing this you should be able to reduce your fill rate burden, but at a cost of increased draw calls, and there might be an unnecessarily high amount of added complexity to your code and your tools.

Related

OpenGL (ES): Can an implementation optimize fragments resulting from overdraw?

I wanted to come up with a crude way to "benchmark" the performance improvement of a tweak I made to a fragment shader (to be specific, I wanted to test the performance impact of the removal of the computation of the gamma for the resulting color using pow in the fragment shader).
So I figured that if a frame was taking 1ms to render an opaque cube model using my shader that if I set glDisable(GL_DEPTH_TEST) and loop my render call 100 times, that the frame would take 100ms to render.
I was wrong. Rendering it 100 times only results in about a 10x slowdown. Obviously if depth test is still enabled, most if not all of the fragments in the second and subsequent draw calls would not be computed because they would all fail the depth test.
However I must still be experiencing a lot of fragment culls even with depth test off.
My question is about whether my hardware (in this particular situation it is an iPad3 on iOS6.1 that I am experiencing this on -- a PowerVR SGX543MP4) is just being incredibly smart and is actually able to use the geometry of later draw calls to occlude and discard fragments from the earlier geometry. If this is not what's happening, then I cannot explain the better-than-expected performance that I am seeing. The question applies to all flavors of OpenGL and desktop GPUs as well, though.
Edit: I think an easy way to "get around" this optimization might be glEnable(GL_BLEND) or something of that sort. I will try this and report back.
PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.
Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.
In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.
I can confirm that only after adding both lines:
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA);
did the frame render time increase in a linear fashion in response to the repeated draw calls. Now back to my crude benchmarking.

Depth peeling without occlusion query

I want to implement the "depth peeling" in webgl but the problem is that there is no occlusion query so I don't know how to check when the "peeling" of the scene is over.
Do you see an other way to do that?
The usual approach is to limit the peeling to a certain amount of steps. This is sometimes even better than using occlusion queries because to many layers of transparent structure become close to impossible to discern from each other. It often helps to know what you are exactly rendering to get a good estimate of the amount of layers you need to peel.
I recently implemented depth peeling in webgl. There are a few limiting factors that make it kinda hard to do as many peels as layers. Mainly a very limited amount of texture units and the fact you can only render to one target at a time, so you have to render color and depth seperately. With 7 textures used I can do 4 peels. That already takes 11 render passes per frame. To do more peels you would need to do a bit more sophisticated merging of intermediate results. I doubt you gain much from more peels.

Rendering realistic electric lightning using OpenGl

I'm implementing a simple lightning effect for my 3D game, something like this:
http://www.krazydad.com/bestiary/bestiary_lightning.html
I'm using opengl ES 2.0. I'm pondering what the best looking and most performance efficient way to render this in a 3D environment is though, as the lines making up the electric bolt needs to be looking "solid" when viewed from any angle.
I was thinking to generate two planes for each line segment, in an X cross to create an effect of line thickness. Rendering by disabling depth buffer writes, using some kind off additive blending mode. Texturing each line segment using an electric looking texture with an alpha channel.
I'm a bit worried about the performance hit from generating the necessary triangle lists using this method though, as my game will potentially have a lot of lightning bolts generated at the same time. But as the length and thickness of the lightning bolts will vary a lot, I doubt it would look good to simply use an animated 3D object of an lightning bolt, stretched and pointing to the right location, which was my initial idea.
I was thinking of an alternative approach where I render the lightning bolts using 2D lines between projected end points in a post processing pass. That should work well since the perspective effect in my case is negligible, except then it would be tricky to have the lines appear behind occluding objects.
Any good ideas on the best approach here?
Edit: I found this white paper from nVidia:
http://developer.download.nvidia.com/SDK/10/direct3d/Source/Lightning/doc/lightning_doc.pdf
Which uses an approach with having billboards for each line segment, then apply some filtering to smooth the resulting gaps and overlaps from each billboard.
Seems to yield pretty good visual results, however I am not too happy about the additional filtering pass as the game is for mobile phones where such a step is quite costly. And, as it turns out, billboarding is quite CPU expensive too, due to the additional matrix calculation overhead, which is slow on mobile devices.
I ended up doing something like the nVidia paper suggested, but to prevent the need for a postprocessing step I used different kind of textures for different kind of branching angles, to avoid gaps and overlaps of the segment corners, which turned out quite well. And to avoid the expensive billboard matrix calculation I instead drew the line segments using a more 2D approach, but calculating the depth value manually for each vertex in the segments. This yields both acceptable performance and visuals.
An animated texture, possibly powered by a shader, is likely the fastest way to handle this.
Any geometry generation and rendering will limit the quality of the effect, and may take significantly more CPU time, memory bandwidth and draw calls.
Using a single animated texture on a quad, or a shader creating procedural lightning, will give constant speed and make the effect much simpler to implement. For that, this question may be of interest.

OpenGL - Will using multiple VBO's slow down rendering?

I am rendering some meshes (sometimes upwards of 500) and I wanted to know the best way to approach this. Would it be pointless to create 500 VBOs and then if they pass the frustum and visibility tests, render them. Is there a more efficient way to do this? I am looking to maximize performance.
To answer your question, yes, many VBOs will slow things down. More polys will usually slow down the render, but more draw calls has a much greater hit. You want to minimize state changes and draws, as well as the number of buffers you have (and memory use).
I would suggest first looking at the buffers and figuring out how many you need. If you can batch/instance geometry, merge static geometry into a single buffer, reuse buffer more efficiently, etc.
Once you've cut the buffers down to the minimum possible, you'll want to use culling of multiple sorts. Visibility, both by frustrum (perhaps in an octree) and occlusion, can provide a significant performance boost. The main idea is to disqualify the geometry as fast and simply as possible, so you start with rough tests (octree), then somewhat more detailed (perhaps an AABB and/or simplified hull), then occlusion, then actually draw.
Here's a good article on frustrum culling, which touches a bit on quadtrees (and by extension, octrees). Diagrams, explanations and some sample code.
OpenGL occlusion culling articles seem a bit less common, although this one from GPU Gems might be a good starting place.

Tackling alpha blend in OpenGL for better performance

Since having blends is hitting perfomance of our game, we tried several blending strategies for creating the "illusion" of blending. One of them is drawing a sprite every odd frame, resulting in the sprite being visible half of the time. The effect is quit good. (You'd need a proper frame rate by the way, else your sprite would be noticeably flickering)
Despite that, I would like to know if there are any good insights out there in avoiding blending in order to better the overal performance without compromising (too much) of the visual experience.
Is it the actual blending that's killing your performance? (i.e. video memory bandwidth)
What games commonly do these days to handle lots of alpha blended stuff (think large explosions that cover whole screen): render them into a smaller texture (e.g. 2x2 smaller or 4x4 smaller than screen), and composite them back onto the main screen.
Doing that might require rendering depth buffer of opaque surfaces into that smaller texture as well, to properly handle intersections with opaque geometry. On some platforms (consoles) doing multisampling or depth buffer hackery might make that a very cheap operation; no such luck on regular PC though.
See article from GPU Gems 3 for example: High-Speed, Off-Screen Particles. Christer Ericson's blog post overviews a lot of optimization approaches as well: Optimizing the rendering of a particle system
Excellent article here about rendering particle systems quickly. It covers the smaller off screen buffer technique and suggest quite a few other approaches.
You can read it here
It is not quite clear from your question what kind of application of blending hits your game's performance. Generally blending is blazingly fast. If your problems are particle system related, then what is most likely to kill framerate is the number and size of particles drawn. Particularly lots of close up (and therefore large) particles will require high memory bandwidth and fill rate of the graphics card. I have implemented a particle system myself, and while I can render tons of particles in the distance, I feel the negative impact of e.g. flying through smoke (that will fill the entire screen because the viewer is amidst of it) very much on weaker hardware.

Resources