I wanted to come up with a crude way to "benchmark" the performance improvement of a tweak I made to a fragment shader (to be specific, I wanted to test the performance impact of the removal of the computation of the gamma for the resulting color using pow in the fragment shader).
So I figured that if a frame was taking 1ms to render an opaque cube model using my shader that if I set glDisable(GL_DEPTH_TEST) and loop my render call 100 times, that the frame would take 100ms to render.
I was wrong. Rendering it 100 times only results in about a 10x slowdown. Obviously if depth test is still enabled, most if not all of the fragments in the second and subsequent draw calls would not be computed because they would all fail the depth test.
However I must still be experiencing a lot of fragment culls even with depth test off.
My question is about whether my hardware (in this particular situation it is an iPad3 on iOS6.1 that I am experiencing this on -- a PowerVR SGX543MP4) is just being incredibly smart and is actually able to use the geometry of later draw calls to occlude and discard fragments from the earlier geometry. If this is not what's happening, then I cannot explain the better-than-expected performance that I am seeing. The question applies to all flavors of OpenGL and desktop GPUs as well, though.
Edit: I think an easy way to "get around" this optimization might be glEnable(GL_BLEND) or something of that sort. I will try this and report back.
PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.
Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.
In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.
I can confirm that only after adding both lines:
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA);
did the frame render time increase in a linear fashion in response to the repeated draw calls. Now back to my crude benchmarking.
Related
I have a retro-looking 2D game with a lot of sprites (reminiscent of Sega's Super Scaler arcades) which do not use semi-transparency. I have thought about using the Z-Buffer over sorting to simplify things. Ok, but by default writes are done to the Z-buffer even though alpha is zero, giving the effect illustrated here:
http://i.stack.imgur.com/ubLlp.png
Now, since I'm in OpenGL ES 2, I don't have alpha testing, so from what I understand my only possibility is to discard the pixel from the fragment shader if alpha is 0 so that it doesn't get written to the Z-Buffer. But in terms of performance this is SO wrong: not only the if is slow, but the discard basically kills the purpose since it disables early depth testing and the result is way worse than doing it in software.
if (val.a < 0.5) {
discard;
}
Is there any other solution I could use which would not kill the performance? Do all 2D games sort sprites themselves and not use depth buffer?
It's a tradeoff really. If you let the z-buffer do the sorting and use discard in your shaders then it's more expensive on the GPU because of branching and late depth testing as you say.
If you do the depth sorting yourself, then you'll find it's harder to issue your draw calls in an optimal order (e.g. you'll keep having to change texture). Draw calls on GLES2 have a very significant CPU hit on lower end devices and the count will probably go up.
If performance is a big concern, then probably the second option is better if you do it in conjunction with a big effort on the texture atlasing front to minimize your draw call count, this might be particularly effective if your sprites are low resolution retro sprites because you'll be able to get a lot of sprites per texture atlas. It isn't a clear winner by any stretch and I can imagine that different games take different approaches.
Also, you should take into account that the vast majority of target hardware is going to perform just fine whichever path you choose, and maybe you should just choose the one that is faster to implement and makes your code simpler (which is probably letting the z-buffer do the sorting).
If you fancy a technical challenge, I've often thought the best approach might be divide up your sprites into fully opaque sections and sections with transparency and render the two parts as separate meshes (they won't be quads any more). You'd have to do a lot of preprocessing and draw a lot more triangles, but by being able to do some rendering with fully-opaque parts then you can take advantage of the hidden-surface-removal tech in all iOS devices and lots of Android devices. Certainly by doing this you should be able to reduce your fill rate burden, but at a cost of increased draw calls, and there might be an unnecessarily high amount of added complexity to your code and your tools.
I've built a scene graph based rendering system for my app's UI (http://audulus.com if you're curious). The app's UI is procedural, and a lot of it is animated. There are few pre-rendered images.
Currently, it caches unchanging groups of drawables in mipmapped textures. I use mipmaps because the UI is zoomable. Overall, this has been a big performance win, but there are several downsides:
Building the mipmaps (via glGenerateMipmap) takes time, reducing the frame rate when one part of the UI goes from animated to static.
Visual differences between the texture-cached geometry and not, causing slight flickering. (Might be able to get around this by being more clever with my path rendering code, but it seems hard)
Memory usage for all the textures (I could dump the offscreen textures, but that exacerbates problem 1)
A couple alternative approaches I've thought of:
Instead of texture caching, coalesce static paths into bigger paths. My paths are already VBO/VAO-based, but this could reduce the number of GL calls. (When turning off texture caching, my performance is mainly CPU-bound). Big win on memory usage. The primary problems with this approach are: complicating my path rendering shader (since it must handle paths with different attributes within one call to glDrawArrays), not handling the caching of other primitives (such as text), and more of a burden on the GPU than simply rendering a texture.
Still use textures, but avoid mip-mapping. As the UI is zoomed, textures could be resized (though this might have to be deferred since re-rendering the whole UI during zooming is too expensive). Delete textures for offscreen geometry. Downside of course is poor texture magnification/minification during UI zooming.
UPDATE
I tried (2). Resizing the textures is quite slow, so I prevent the UI from resizing them during zoom. This works reasonably well, but the magnification looks terrible when zooming starting small:
Note that some of the modules aren't texture cached because they are tagged as animating.
UPDATE 2
I'm beginning to work on approach 1, so I deactivated the texture caching.
Though I'm CPU bound, practically all my GPU-side load comes from my path anti-aliasing fragment shader. Here's what performance looks like with it on:
And with it off:
So further optimization of that will be a big win on the GPU side. I tried ditching it and going with 4x supersampling, but that looks like garbage, reminding me why I spent considerable time working on the path rendering shader.
Approach (1) is a big win. It has essentially the same performance as texture caching, but without any visual artifacts or much memory usage.
I'm implementing a simple lightning effect for my 3D game, something like this:
http://www.krazydad.com/bestiary/bestiary_lightning.html
I'm using opengl ES 2.0. I'm pondering what the best looking and most performance efficient way to render this in a 3D environment is though, as the lines making up the electric bolt needs to be looking "solid" when viewed from any angle.
I was thinking to generate two planes for each line segment, in an X cross to create an effect of line thickness. Rendering by disabling depth buffer writes, using some kind off additive blending mode. Texturing each line segment using an electric looking texture with an alpha channel.
I'm a bit worried about the performance hit from generating the necessary triangle lists using this method though, as my game will potentially have a lot of lightning bolts generated at the same time. But as the length and thickness of the lightning bolts will vary a lot, I doubt it would look good to simply use an animated 3D object of an lightning bolt, stretched and pointing to the right location, which was my initial idea.
I was thinking of an alternative approach where I render the lightning bolts using 2D lines between projected end points in a post processing pass. That should work well since the perspective effect in my case is negligible, except then it would be tricky to have the lines appear behind occluding objects.
Any good ideas on the best approach here?
Edit: I found this white paper from nVidia:
http://developer.download.nvidia.com/SDK/10/direct3d/Source/Lightning/doc/lightning_doc.pdf
Which uses an approach with having billboards for each line segment, then apply some filtering to smooth the resulting gaps and overlaps from each billboard.
Seems to yield pretty good visual results, however I am not too happy about the additional filtering pass as the game is for mobile phones where such a step is quite costly. And, as it turns out, billboarding is quite CPU expensive too, due to the additional matrix calculation overhead, which is slow on mobile devices.
I ended up doing something like the nVidia paper suggested, but to prevent the need for a postprocessing step I used different kind of textures for different kind of branching angles, to avoid gaps and overlaps of the segment corners, which turned out quite well. And to avoid the expensive billboard matrix calculation I instead drew the line segments using a more 2D approach, but calculating the depth value manually for each vertex in the segments. This yields both acceptable performance and visuals.
An animated texture, possibly powered by a shader, is likely the fastest way to handle this.
Any geometry generation and rendering will limit the quality of the effect, and may take significantly more CPU time, memory bandwidth and draw calls.
Using a single animated texture on a quad, or a shader creating procedural lightning, will give constant speed and make the effect much simpler to implement. For that, this question may be of interest.
This is a difficult question to search in Google since it has other meaning in finance.
Of course, what I mean here is "Drawing" as in .. computer graphics.. not money..
I am interested in preventing overdrawing for both 3D Drawing and 2D Drawing.
(should I make them into two different questions?)
I realize that this might be a very broad question since I didn't specify which technology to use. If it is too broad, maybe some hints on some resources I can read up will be okay.
EDIT:
What I mean by overdrawing is:
when you draw too many objects, rendering single frame will be very slow
when you draw more area than what you need, rendering a single frame will be very slow
It's quite complex topic.
First thing to consider is frustum culling. It will filter out objects that are not in camera’s field of view so you can just pass them on render stage.
The second thing is Z-sorting of objects that are in camera. It is better to render them from front to back so that near objects will write “near-value” to the depth buffer and far objects’ pixels will not be drawn since they will not pass depth test. This will save your GPU’s fill rate and pixel-shader work. Note however, if you have semitransparent objects in scene, they should be drawn first in back-to-front order to make alpha-blending possible.
Both things achievable if you use some kind of space partition such as Octree or Quadtree. Which is better depends on your game. Quadtree is better for big open spaces and Octree is better for in-door spaces with many levels.
And don't forget about simple back-face culling that can be enabled with single line in DirectX and OpenGL to prevent drawing of faces that are look at camera with theirs back-side.
Question is really too broad :o) Check out these "pointers" and ask more specifically.
Typical overdraw inhibitors are:
Z-buffer
Occlusion based techniques (various buffer techniques, HW occlusions, ...)
Stencil test
on little bit higher logic level:
culling (usually by view frustum)
scene organization techniques (usually trees or tiling)
rough drawing front to back (this is obviously supporting technique :o)
EDIT: added stencil test, has indeed interesting overdraw prevention uses especially in combination of 2d/3d.
Reduce the number of objects you consider for drawing based on distance, and on position (ie. reject those outside of the viewing frustrum).
Also consider using some sort of object-based occlusion system to allow large objects to obscure small ones. However this may not be worth it unless you have a lot of large objects with fairly regular shapes. You can pre-process potentially visible sets for static objects in some cases.
Your API will typically reject polygons that are not facing the viewpoint also, since you typically don't want to draw the rear-face.
When it comes to actual rendering time, it's often helpful to render opaque objects from front-to-back, so that the depth-buffer tests end up rejecting entire polygons. This works for 2D too, if you have depth-buffering turned on.
Remember that this is a performance optimisation problem. Most applications will not have a significant problem with overdraw. Use tools like Pix or NVIDIA PerfHUD to measure your problem before you spend resources on fixing it.
Since having blends is hitting perfomance of our game, we tried several blending strategies for creating the "illusion" of blending. One of them is drawing a sprite every odd frame, resulting in the sprite being visible half of the time. The effect is quit good. (You'd need a proper frame rate by the way, else your sprite would be noticeably flickering)
Despite that, I would like to know if there are any good insights out there in avoiding blending in order to better the overal performance without compromising (too much) of the visual experience.
Is it the actual blending that's killing your performance? (i.e. video memory bandwidth)
What games commonly do these days to handle lots of alpha blended stuff (think large explosions that cover whole screen): render them into a smaller texture (e.g. 2x2 smaller or 4x4 smaller than screen), and composite them back onto the main screen.
Doing that might require rendering depth buffer of opaque surfaces into that smaller texture as well, to properly handle intersections with opaque geometry. On some platforms (consoles) doing multisampling or depth buffer hackery might make that a very cheap operation; no such luck on regular PC though.
See article from GPU Gems 3 for example: High-Speed, Off-Screen Particles. Christer Ericson's blog post overviews a lot of optimization approaches as well: Optimizing the rendering of a particle system
Excellent article here about rendering particle systems quickly. It covers the smaller off screen buffer technique and suggest quite a few other approaches.
You can read it here
It is not quite clear from your question what kind of application of blending hits your game's performance. Generally blending is blazingly fast. If your problems are particle system related, then what is most likely to kill framerate is the number and size of particles drawn. Particularly lots of close up (and therefore large) particles will require high memory bandwidth and fill rate of the graphics card. I have implemented a particle system myself, and while I can render tons of particles in the distance, I feel the negative impact of e.g. flying through smoke (that will fill the entire screen because the viewer is amidst of it) very much on weaker hardware.