graphics: best performance with floating point accumulation images - performance

I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?

The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.

It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.

Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).

If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.

Related

Rendering only to a part of a texture

Can I bind a 2000x2000 texture to a color attachment in a FBO, and tell OpenGL to behave exactly as if the texture was smaller, let's say 1000x1000?
The point is, in my rendering cycle I need many (mostly small) intermediate textures to render to, but I need only 1 at a time. I am thinking that, rather than creating many smaller textures, I will have only 1 appropriately large, and I will bind it to an FBO at hand, tell OpenGL to render only to part of it, and save memory.
Or maybe I should be destroying/recreating those textures many times per frame? That would certainly save even more memory, but wouldn't that cause a noticeable slowdown?
Can I bind a 2000x2000 texture to a color attachment in a FBO, and
tell OpenGL to behave exactly as if the texture was smaller, let's say
1000x1000?
Yes, just set glViewport() to the region you want to render to, and remember to adjust glScissor() bounding regions if you are ever enabling scissor testing.
Or maybe I should be destroying/recreating those textures many times
per frame? That would certainly save even more memory, but wouldn't
that cause a noticeable slowdown?
Completely destroying and recreating a new texture object every frame will be slow because it will cause constant memory reallocation overhead, so definitely don't do that.
Having a pool of pre-allocated textures which you cycle though is fine though - that's a pretty common technique. You won't really save much in terms of memory storing a 2K*2K texture vs storing 4 separate 1K*1K textures - the total storage requirement is the same and the additional metadata overhead is tiny in comparison - so if keeping them separate is easier in terms of application logic I'd suggest doing that.

How to get good performance on the gfx card with images larger than the max texture size?

At work, I work with very large images.
I currently do my rendering via SDL2.
The max texture size on the graphics card my machine uses is 8192x8192.
Because my data sets are larger than what will fit in a single texture, I split my image into multiple textures after it is loaded, and tile them.
However, I have found that this comes at a very steep cost. Rendering only 4 textures around 5K by 5K (pixels) each completely tanks the framerate!
Conventional wisdom tells me that the fewer texture swaps the better, but with such large images I've found myself between a rock and a hard place.
One thing I've considered is that perhaps if I were to chunck the images up into many small textures, I could take advantage of culling which would hopefully be a net win. But there's a big problem with that approach - I need to be able to zoom out.
Another option would be to down scale the images. This seems promising as the analysis I am doing on the images do not require the high resolution that the images provide.
I know that OpenGL has mipmapping, but I am inexperienced with OpenGL and am weary of diving into it for a work project. I am not aware of a good way to downscale the images within the confines of SDL2, and for reasons specific to the work I am doing, scaling the images down offline (before I load them) is not appealing.
What is the best approach for me to get the highest framerate in this situation?

DirectX9 - Efficiently Drawing Sprites

I'm trying to create a platformer game, and I am taking various sprite blocks, and piecing them together in order to draw the level. This requires drawing a large number of sprites on the screen every single frame. A good computer has no problem handling drawing all the sprites, but it starts to impact performance on older computers. Since this is NOT a big game, I want it to be able to run on almost any computer. Right now, I am using the following DirectX function to draw my sprites:
D3DXVECTOR3 center(0.0f, 0.0f, 0.0f);
D3DXVECTOR3 position(static_cast<float>(x), static_cast<float>(y), z);
(my LPD3DXSPRITE object)->Draw((sprite texture pointer), NULL, &center, &position, D3DCOLOR_ARGB(a, r, g, b));
Is there a more efficient way to draw these pictures on the screen? Is there a way that I can use less complex picture files (I'm using regular png's right now) to speed things up?
To sum it up: What is the most performance friendly way to draw sprites in DirectX? thanks!
The ID3DXSPRITE interface you are using is already pretty efficient. Make sure all your sprite draw calls happen in one batch if possible between the sprite begin and end calls. This allows the sprite interface to arrange the draws in the most efficient way.
For extra performance you can load multiple smaller textures in to one larger texture and use texture coordinates to get them out. This makes it so textures don't have to be swapped as frequently. See:
http://nexe.gamedev.net/directknowledge/default.asp?p=ID3DXSprite
The file type you are using for the textures does not matter as long as they are are preloaded into textures. Make sure you load them all in to textures once when the game/level is loading. Once you have loaded them in to textures it does not matter what format they were originally in.
If you still are not getting the performance you want, try using PIX to profile your application and find where the bottlenecks really are.
Edit:
This is too long to fit in a comment, so I will edit this post.
When I say swapping textures I mean binding them to a texture stage with SetTexture. Each time SetTexture is called there is a small performance hit as it changes the state of the texture stage. Normally this delay is fairly small, but can be bad if DirectX has to pull the texture from system memory to video memory.
ID3DXsprite will reorder the draws that are between begin and end calls for you. This means SetTexture will typically only be called once for each texture regardless of the order you draw them in.
It is often worth loading small textures into a large one. For example if it were possible to fit all small textures in to one large one, then the texture stage could just stay bound to that texture for all draws. Normally this will give a noticeable improvement, but testing is the only way to know for sure how much it will help. It would look terrible, but you could just throw in any large texture and pretend it is the combined one to test what performance difference there would be.
I agree with dschaeffer, but would like to add that if you are using a large number different textures, it may better to smush them together on a single (or few) larger textures and adjust the texture coordinates for different sprites accordingly. Texturing state changes cost a lot and this may speed things up on older systems.

OpenGL performance on rendering "virtual gallery" (textures)

I have a considerable (120-240) amount of 640x480 images that will be displayed as textured flat surfaces (4 vertex polygons) in a 3D environment. About 30-50% of them will be visible in a given frame. It is possible for them to crossover. Nothing else will be present in the environment.
The question is - will the modern and/or few-years-old (lets say Radeon 9550) GPU cope with that, and what frame rate can I expect? I aim for 20FPS, but 30-40 would be nice. Would changing the resolution to 320x240 make it more probable to happen?
I do not have any previous experience with performance issues of 3D graphics on modern GPUs, and unfortunately I must make a design choice. I don't want to waste time on doing something that couldn't have worked :-)
Assuming you have RGB textures, that would be 640*480*3*120 Bytes = 105 MB minimum of texture data, which should fit in VRAM of more recent graphics cards without swapping, so this wont be of an issue. However, texture lookups might get a bit problematic but this is hard to judge for me without trying. Given that you only need to process 50% of 105 MB, that is about 50 MB (very rough estimate) while targetting 20 FPS means 20*50MB/sec = about 1GB/sec. This should be possible to throughput even on older hardware.
Reading the specs of an older Radeon 9600 XT, it says peak fill-rate of 2000Mpixels/sec and if i'm not mistake you require far less than 100Mpixels/sec. Peak memory b/w is specified with 9.6GB/s, while you'd need about 1 GB/s (as explained above).
It would argue that this should be possible, if done correctly - esp. current hardware should have not problem at all.
Anyways, you should simply try out: Loading some random 120 textures and displaying them in some 120 quads can be done in very few lines of code with hardly any effort.
First of all, you should realize that the dimensions of textures should normally be powers of two, so if you can change them something like 512x256 (for example) would be a better starting point.
From that, you can create MIPmaps of the original, which are simply versions of the original scaled down by powers of two, so if you started with 512x256, you'd then create versions at 256x128, 128x64, 64x32, 32x16, 16x8, 8x4, 4x2, 2x1 and 1x1. When you've done this, OpenGL can/will select the "right" one for the size it'll show up at in the final display. This generally reduces the work (and improves quality) in scaling the texture to the desired size.
The obvious sticking point with that would be running out of texture memory. If memory serves, in the 9550 timeframe you could probably expect 256 MB of on-board memory, which would be about sufficient, but chances are pretty good that some of the textures would be in system RAM. That overflow would probably be fairly small though, so it probably won't be terribly difficult to maintain the kind of framerate you're hoping for. If you were to add a lot more textures, however, it would eventually become a problem. In that case, reducing the original size by 2 in each dimension (for example) would reduce your memory requirement by a factor of 4, which would make fitting them into memory a lot easier.

Tackling alpha blend in OpenGL for better performance

Since having blends is hitting perfomance of our game, we tried several blending strategies for creating the "illusion" of blending. One of them is drawing a sprite every odd frame, resulting in the sprite being visible half of the time. The effect is quit good. (You'd need a proper frame rate by the way, else your sprite would be noticeably flickering)
Despite that, I would like to know if there are any good insights out there in avoiding blending in order to better the overal performance without compromising (too much) of the visual experience.
Is it the actual blending that's killing your performance? (i.e. video memory bandwidth)
What games commonly do these days to handle lots of alpha blended stuff (think large explosions that cover whole screen): render them into a smaller texture (e.g. 2x2 smaller or 4x4 smaller than screen), and composite them back onto the main screen.
Doing that might require rendering depth buffer of opaque surfaces into that smaller texture as well, to properly handle intersections with opaque geometry. On some platforms (consoles) doing multisampling or depth buffer hackery might make that a very cheap operation; no such luck on regular PC though.
See article from GPU Gems 3 for example: High-Speed, Off-Screen Particles. Christer Ericson's blog post overviews a lot of optimization approaches as well: Optimizing the rendering of a particle system
Excellent article here about rendering particle systems quickly. It covers the smaller off screen buffer technique and suggest quite a few other approaches.
You can read it here
It is not quite clear from your question what kind of application of blending hits your game's performance. Generally blending is blazingly fast. If your problems are particle system related, then what is most likely to kill framerate is the number and size of particles drawn. Particularly lots of close up (and therefore large) particles will require high memory bandwidth and fill rate of the graphics card. I have implemented a particle system myself, and while I can render tons of particles in the distance, I feel the negative impact of e.g. flying through smoke (that will fill the entire screen because the viewer is amidst of it) very much on weaker hardware.

Resources