I've built a scene graph based rendering system for my app's UI (http://audulus.com if you're curious). The app's UI is procedural, and a lot of it is animated. There are few pre-rendered images.
Currently, it caches unchanging groups of drawables in mipmapped textures. I use mipmaps because the UI is zoomable. Overall, this has been a big performance win, but there are several downsides:
Building the mipmaps (via glGenerateMipmap) takes time, reducing the frame rate when one part of the UI goes from animated to static.
Visual differences between the texture-cached geometry and not, causing slight flickering. (Might be able to get around this by being more clever with my path rendering code, but it seems hard)
Memory usage for all the textures (I could dump the offscreen textures, but that exacerbates problem 1)
A couple alternative approaches I've thought of:
Instead of texture caching, coalesce static paths into bigger paths. My paths are already VBO/VAO-based, but this could reduce the number of GL calls. (When turning off texture caching, my performance is mainly CPU-bound). Big win on memory usage. The primary problems with this approach are: complicating my path rendering shader (since it must handle paths with different attributes within one call to glDrawArrays), not handling the caching of other primitives (such as text), and more of a burden on the GPU than simply rendering a texture.
Still use textures, but avoid mip-mapping. As the UI is zoomed, textures could be resized (though this might have to be deferred since re-rendering the whole UI during zooming is too expensive). Delete textures for offscreen geometry. Downside of course is poor texture magnification/minification during UI zooming.
UPDATE
I tried (2). Resizing the textures is quite slow, so I prevent the UI from resizing them during zoom. This works reasonably well, but the magnification looks terrible when zooming starting small:
Note that some of the modules aren't texture cached because they are tagged as animating.
UPDATE 2
I'm beginning to work on approach 1, so I deactivated the texture caching.
Though I'm CPU bound, practically all my GPU-side load comes from my path anti-aliasing fragment shader. Here's what performance looks like with it on:
And with it off:
So further optimization of that will be a big win on the GPU side. I tried ditching it and going with 4x supersampling, but that looks like garbage, reminding me why I spent considerable time working on the path rendering shader.
Approach (1) is a big win. It has essentially the same performance as texture caching, but without any visual artifacts or much memory usage.
Related
Can I bind a 2000x2000 texture to a color attachment in a FBO, and tell OpenGL to behave exactly as if the texture was smaller, let's say 1000x1000?
The point is, in my rendering cycle I need many (mostly small) intermediate textures to render to, but I need only 1 at a time. I am thinking that, rather than creating many smaller textures, I will have only 1 appropriately large, and I will bind it to an FBO at hand, tell OpenGL to render only to part of it, and save memory.
Or maybe I should be destroying/recreating those textures many times per frame? That would certainly save even more memory, but wouldn't that cause a noticeable slowdown?
Can I bind a 2000x2000 texture to a color attachment in a FBO, and
tell OpenGL to behave exactly as if the texture was smaller, let's say
1000x1000?
Yes, just set glViewport() to the region you want to render to, and remember to adjust glScissor() bounding regions if you are ever enabling scissor testing.
Or maybe I should be destroying/recreating those textures many times
per frame? That would certainly save even more memory, but wouldn't
that cause a noticeable slowdown?
Completely destroying and recreating a new texture object every frame will be slow because it will cause constant memory reallocation overhead, so definitely don't do that.
Having a pool of pre-allocated textures which you cycle though is fine though - that's a pretty common technique. You won't really save much in terms of memory storing a 2K*2K texture vs storing 4 separate 1K*1K textures - the total storage requirement is the same and the additional metadata overhead is tiny in comparison - so if keeping them separate is easier in terms of application logic I'd suggest doing that.
I have a sphere with texture of earth that I generate on the fly with the canvas element from an SVG file and manipulate it.
The texture size is 16384x8192 , and less than this - it's look blurry on close zoom.
But this is a huge texture size and causing memory problems... (But it's look very good when it is working)
I think a better approach would be to split the sphere into 32 separated textures, each in size of 2048x2048
A few questions:
How can I split the sphere and assign the right textures?
Is this approach better in terms of memory and performance from a single huge texture?
Is there a better solution?
Thanks
You could subdivide a cube, and cubemap this.
Instead of having one texture per face, you would have NxN textures. 32 doesn't sound like a good number, but 24 for example does, (6x2x2).
You will still use the same amount of memory. If the shape actually needs to be spherical you can further subdivide the segments and normalize the entire shape (spherify it).
You probably cant even use such a big texture anyway.
notice the top sphere (cubemap, ignore isocube):
Typically, that's not something you'd do programmatically, but in a 3D program like Blender or 3D max. It involves some trivial mesh separation, UV mapping and material assignment. One other approach that's worth experimenting with would be to have multiple materials but only one mesh - you'd still get (somewhat) progressive loading. BUT
Are you sure you'd be better off with "chunks" loading sequentially rather than one big texture taking a huge amount of time? Sure, it'll improve a bit in terms of timeouts and caching, but the tradeoff is having big chunks of your mesh be textureless, which is noticeable and unasthetic.
There are a few approaches that would mitigate your problem. First, it's important to understand that texture loading optimization techniques - while common in game engines - aren't really part of threejs or what it's built for. You'll never get the near-seamless LODs or GPU optimization techniques that you'll get with UE4 or Unity. Furthermore webGL - while having made many strides over the past decade - is not ideal for handling vast texture sizes, not at the GPU level (since it's based on OpenGL ES, suited primarily for mobile devices) and certainly not at the caching level - we're still dealing with broswers here. You won't find a lot of webGL work done with vast textures of the dimensions you refer to.
Having said that,
A. A loader will let you do other things while your textures are loading so your user isn't staring at an 'unfinished mesh'. It lets you be pretty clever with dynamic loading times and UX design. Additionally, take a look at this gist to give you an idea for what a progressive texture loader could look like. A much more involved technique, that's JPEG specific, can be found here but I wouldn't approach it unless you're comfortable with low-level graphics programming.
B. Threejs does have a basic implementation of LOD although I haven't tinkered with it myself and am not sure it's useful for textures; that said, the basic premise to inquire into is whether you can load progressively higher-resolution files on a per-need basis, just like Google Earth does it for example.
C. This is out of the scope of your question - but I'd look into what happens under the hood in Unity's webgl export (which is based on threejs), and what kind of clever tricks are being employed there for similar purposes.
Finally, does your project have to be in webgl? For something ambitious and demanding, sometimes "proper" openGL / DX makes much more sense.
I wanted to come up with a crude way to "benchmark" the performance improvement of a tweak I made to a fragment shader (to be specific, I wanted to test the performance impact of the removal of the computation of the gamma for the resulting color using pow in the fragment shader).
So I figured that if a frame was taking 1ms to render an opaque cube model using my shader that if I set glDisable(GL_DEPTH_TEST) and loop my render call 100 times, that the frame would take 100ms to render.
I was wrong. Rendering it 100 times only results in about a 10x slowdown. Obviously if depth test is still enabled, most if not all of the fragments in the second and subsequent draw calls would not be computed because they would all fail the depth test.
However I must still be experiencing a lot of fragment culls even with depth test off.
My question is about whether my hardware (in this particular situation it is an iPad3 on iOS6.1 that I am experiencing this on -- a PowerVR SGX543MP4) is just being incredibly smart and is actually able to use the geometry of later draw calls to occlude and discard fragments from the earlier geometry. If this is not what's happening, then I cannot explain the better-than-expected performance that I am seeing. The question applies to all flavors of OpenGL and desktop GPUs as well, though.
Edit: I think an easy way to "get around" this optimization might be glEnable(GL_BLEND) or something of that sort. I will try this and report back.
PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.
Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.
In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.
I can confirm that only after adding both lines:
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA);
did the frame render time increase in a linear fashion in response to the repeated draw calls. Now back to my crude benchmarking.
I'm trying to create a platformer game, and I am taking various sprite blocks, and piecing them together in order to draw the level. This requires drawing a large number of sprites on the screen every single frame. A good computer has no problem handling drawing all the sprites, but it starts to impact performance on older computers. Since this is NOT a big game, I want it to be able to run on almost any computer. Right now, I am using the following DirectX function to draw my sprites:
D3DXVECTOR3 center(0.0f, 0.0f, 0.0f);
D3DXVECTOR3 position(static_cast<float>(x), static_cast<float>(y), z);
(my LPD3DXSPRITE object)->Draw((sprite texture pointer), NULL, ¢er, &position, D3DCOLOR_ARGB(a, r, g, b));
Is there a more efficient way to draw these pictures on the screen? Is there a way that I can use less complex picture files (I'm using regular png's right now) to speed things up?
To sum it up: What is the most performance friendly way to draw sprites in DirectX? thanks!
The ID3DXSPRITE interface you are using is already pretty efficient. Make sure all your sprite draw calls happen in one batch if possible between the sprite begin and end calls. This allows the sprite interface to arrange the draws in the most efficient way.
For extra performance you can load multiple smaller textures in to one larger texture and use texture coordinates to get them out. This makes it so textures don't have to be swapped as frequently. See:
http://nexe.gamedev.net/directknowledge/default.asp?p=ID3DXSprite
The file type you are using for the textures does not matter as long as they are are preloaded into textures. Make sure you load them all in to textures once when the game/level is loading. Once you have loaded them in to textures it does not matter what format they were originally in.
If you still are not getting the performance you want, try using PIX to profile your application and find where the bottlenecks really are.
Edit:
This is too long to fit in a comment, so I will edit this post.
When I say swapping textures I mean binding them to a texture stage with SetTexture. Each time SetTexture is called there is a small performance hit as it changes the state of the texture stage. Normally this delay is fairly small, but can be bad if DirectX has to pull the texture from system memory to video memory.
ID3DXsprite will reorder the draws that are between begin and end calls for you. This means SetTexture will typically only be called once for each texture regardless of the order you draw them in.
It is often worth loading small textures into a large one. For example if it were possible to fit all small textures in to one large one, then the texture stage could just stay bound to that texture for all draws. Normally this will give a noticeable improvement, but testing is the only way to know for sure how much it will help. It would look terrible, but you could just throw in any large texture and pretend it is the combined one to test what performance difference there would be.
I agree with dschaeffer, but would like to add that if you are using a large number different textures, it may better to smush them together on a single (or few) larger textures and adjust the texture coordinates for different sprites accordingly. Texturing state changes cost a lot and this may speed things up on older systems.
Since having blends is hitting perfomance of our game, we tried several blending strategies for creating the "illusion" of blending. One of them is drawing a sprite every odd frame, resulting in the sprite being visible half of the time. The effect is quit good. (You'd need a proper frame rate by the way, else your sprite would be noticeably flickering)
Despite that, I would like to know if there are any good insights out there in avoiding blending in order to better the overal performance without compromising (too much) of the visual experience.
Is it the actual blending that's killing your performance? (i.e. video memory bandwidth)
What games commonly do these days to handle lots of alpha blended stuff (think large explosions that cover whole screen): render them into a smaller texture (e.g. 2x2 smaller or 4x4 smaller than screen), and composite them back onto the main screen.
Doing that might require rendering depth buffer of opaque surfaces into that smaller texture as well, to properly handle intersections with opaque geometry. On some platforms (consoles) doing multisampling or depth buffer hackery might make that a very cheap operation; no such luck on regular PC though.
See article from GPU Gems 3 for example: High-Speed, Off-Screen Particles. Christer Ericson's blog post overviews a lot of optimization approaches as well: Optimizing the rendering of a particle system
Excellent article here about rendering particle systems quickly. It covers the smaller off screen buffer technique and suggest quite a few other approaches.
You can read it here
It is not quite clear from your question what kind of application of blending hits your game's performance. Generally blending is blazingly fast. If your problems are particle system related, then what is most likely to kill framerate is the number and size of particles drawn. Particularly lots of close up (and therefore large) particles will require high memory bandwidth and fill rate of the graphics card. I have implemented a particle system myself, and while I can render tons of particles in the distance, I feel the negative impact of e.g. flying through smoke (that will fill the entire screen because the viewer is amidst of it) very much on weaker hardware.