I have a considerable (120-240) amount of 640x480 images that will be displayed as textured flat surfaces (4 vertex polygons) in a 3D environment. About 30-50% of them will be visible in a given frame. It is possible for them to crossover. Nothing else will be present in the environment.
The question is - will the modern and/or few-years-old (lets say Radeon 9550) GPU cope with that, and what frame rate can I expect? I aim for 20FPS, but 30-40 would be nice. Would changing the resolution to 320x240 make it more probable to happen?
I do not have any previous experience with performance issues of 3D graphics on modern GPUs, and unfortunately I must make a design choice. I don't want to waste time on doing something that couldn't have worked :-)
Assuming you have RGB textures, that would be 640*480*3*120 Bytes = 105 MB minimum of texture data, which should fit in VRAM of more recent graphics cards without swapping, so this wont be of an issue. However, texture lookups might get a bit problematic but this is hard to judge for me without trying. Given that you only need to process 50% of 105 MB, that is about 50 MB (very rough estimate) while targetting 20 FPS means 20*50MB/sec = about 1GB/sec. This should be possible to throughput even on older hardware.
Reading the specs of an older Radeon 9600 XT, it says peak fill-rate of 2000Mpixels/sec and if i'm not mistake you require far less than 100Mpixels/sec. Peak memory b/w is specified with 9.6GB/s, while you'd need about 1 GB/s (as explained above).
It would argue that this should be possible, if done correctly - esp. current hardware should have not problem at all.
Anyways, you should simply try out: Loading some random 120 textures and displaying them in some 120 quads can be done in very few lines of code with hardly any effort.
First of all, you should realize that the dimensions of textures should normally be powers of two, so if you can change them something like 512x256 (for example) would be a better starting point.
From that, you can create MIPmaps of the original, which are simply versions of the original scaled down by powers of two, so if you started with 512x256, you'd then create versions at 256x128, 128x64, 64x32, 32x16, 16x8, 8x4, 4x2, 2x1 and 1x1. When you've done this, OpenGL can/will select the "right" one for the size it'll show up at in the final display. This generally reduces the work (and improves quality) in scaling the texture to the desired size.
The obvious sticking point with that would be running out of texture memory. If memory serves, in the 9550 timeframe you could probably expect 256 MB of on-board memory, which would be about sufficient, but chances are pretty good that some of the textures would be in system RAM. That overflow would probably be fairly small though, so it probably won't be terribly difficult to maintain the kind of framerate you're hoping for. If you were to add a lot more textures, however, it would eventually become a problem. In that case, reducing the original size by 2 in each dimension (for example) would reduce your memory requirement by a factor of 4, which would make fitting them into memory a lot easier.
Related
I understand mipmapping pretty well. What I do not understand (on a hardware/driver level) is how mipmapping improves the performance of an application (at least this is often claimed). The driver does not know until the fragment shader is executed which mipmap level is going to be accessed, so anyway all mipmap levels need to be present in the VRAM, or am I wrong?
What exactly is causing the performance improvement?
You are no doubt aware that each texel in the lower LODs of the mip-chain covers a higher percentage of the total texture image area, correct?
When you sample a texture at a distant location the hardware will use a lower LOD. When this happens, the sample neighborhood necessary to resolve minification becomes smaller, so fewer (uncached) fetches are necessary. It is all about the amount of memory that actually has to be fetched during texture sampling, and not the amount of memory occupied (assuming you are not running into texture thrashing).
I think this probably deserves a visual representation, so I will borrow the following diagram from the excellent series of tutorials at arcsynthesis.org.
On the left, you see what happens when you naïvely sample at a single LOD all of the time (this diagram is showing linear minification filtering, by the way) and on the right you see what happens with mipmapping. Not only does it improve image quality by more closely matching the fragment's effective size, but because the number of texels in lower mipmap LODs are fewer it can be cached much more efficiently.
Mipmaps are useful at least for two reasons:
visual quality - scenes looks much better in the distance, there is more blur (which is usually better looking than flickering pixels). Additionally Anisotropic filtering can be used that improves visual quality a lot.
performance: since for distant objects we can use smaller texture the whole operation should be faster: sometimes the whole mip can be placed in the texture cache. It is called cache coherency.
great discussion from arcsynthesis about performance
in general mipmaps need only 33% more memory so it is quite low cost for having better quality and a potential performance gain. Note that real performance improvement should be measured for particular scene structure.
see info here: http://www.tomshardware.com/reviews/ati,819-2.html
I've found that the maximum texture size that my opengl can support is 8192 but the image that I'm working with is 16997x15931. As you can see in this link, I've completed the class COpenGLControl and customized it for my own use to work with a smaller 7697x7309 image and activated different navigation tasks for it.
Render an outlined red rectangle on top a 2D texture in OpenGL
but now in the last stages of work, I've decided to change the part where applies the texture and enable it to handle images bigger than the size 8192.
Questions:
Is it possible in my opengl?
what concept should I study mipmaps, multiple texturing?
Will it expand performance of code?
Right now my program uses 271 MB of ram for just showing this small image(7697x7309) and I'm going to add a task to it (for image-processing filtering processes) that I have used all my effort to optimize the code but it uses 376 MB of ram for the (7697x7309) image(the code is already written as a console application will be combined with this project). So I think the final project would use up to 700 MB of ram for images near the 7000x7000 size. Obviously for the bigger image (16997x15931 ) the usage of ram will be alot higher!
So I'm looking for a concept to handle images bigger than the MAX_TEXTURE_SIZE and also optimize the performance of the program
More Questions:
What concept should I study in OpenGL to achieve the above goal?
explain alittle about the concept that you suggest?
I've asked the question in Game Developement too but decided to repeat the question here maybe it will have more viewers. As soon as I get the answer, I will delete the question from either on of the sites. So don't worry about multiple questionings.
I will try to sum up my comments for the original question.
know your proper opengl version: maybe you can load some modern extension and work with even the recent version of opengl.
if it is possible you can take a look at Sparse Textures (Mega Textures): ARB_sparse_texture or AMD_sparse_texture
to reduce memory you can use some texture compression:
How to: load DDS files in OpenGL.
another simple idea: you can split the huge texture and create 4 smaller textures (from 16k x 16k into four 8k x 8k) and somehow render four squares.
maybe you can use OpenCL or CUDA to do the work?
regarding mipmaps: it is set of smaller version of your input texture, mipmaps improve performance and final quality of the filtering, but you need another 33% more memory for a texture with full mipmap chain. In your case they could be very helpful. For instance when you look at a wall from a huge distance you do not have to use full (large) texture... only a small version of it is enough. g-truc on mipmaps
In general there is a lot of options, but it depends on your experience what is simpler and fastest to implement.
I.To switching a shader effect which way is better?
1.using a big shader program and using uniform an if/else clause in shader program to use difference effect.
2.switch program between calls.
II.Is it better to use a big texture or use several small texture? And does upload texture cost mush,how about bind texture?
Well, it would probably be best to write some perf tests and try it but in general.
Small shaders are faster than big.
1 texture is faster the many textures
uploading textures is slow
binding textures is fast
switching programs is slow but usually much faster than combining 2 small programs into 1 big program.
Fragment shaders in particular get executed millions of times a frame. A 1920x1080 display has 2 million pixels so of there was no overdraw that would still mean your shader gets executed 2 million times per frame. For anything executed 2 million times a frame, or 120 million times a second of you're targeting 60 frames per second, smaller is going to be better.
As for textures, mips are faster than no mips because the GPU has a cache for textures and if the pixel it needs next are near the ones it previously read they'll likely already be in the cache. If they are far away they won't be in the cache. That also means randomly reading from a texture is particularly slow. But most apps read fairly linearly through a texture.
Switching programs is slow enough that sorting models by which program they use so that you draw all models that use program A first then all models that use program B is generally faster than drawing them in a random order. But there are other things the effect performance too. For example if a large model is obscuring a small model it's better to draw the large model first since the small model will then fail the depth test (z-buffer) and will not have its fragment shader executed for any pixels. So it's a trade off. All you can really do is test your particular application.
Also, it's important to test in the correct way.
http://updates.html5rocks.com/2012/07/How-to-measure-browser-graphics-performance
I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?
The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.
It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.
Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).
If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.
So here's the situation:
I have a CALayer that is the size of my screen, and I'm setting the contents property to a 2 Mb JPEG that's roughly 3500 x 2000 pixels in size with a resolution of 240ppi.
I'd expect there to be a slight overhead involved in using the CALayer, but my sample application (which only does exactly what's above) shows usage of about 33Mb RSIZE, 22Mb RPVT and 30Mb RSHRD. I've noticed that these numbers are much better when running the application as 64-bit than they are running as a 32-bit process.
I'm doing everything I can think of in the real application that this example comes from, including resampling my CGImageRefs to only be the size of the layer, but this seems extraneous to me - shouldn't it be simpler?
Has anyone come across good methods to reduce the amount of memory CALayers and CGImageRefs use?
First, you're going to run into problems with an image that size in a plain CALayer, because you may hit the texture size limit of 2048 x 2048 (depending on your graphics card). Applications like this are what CATiledLayer is designed for. Bill Dudney has some code examples on his blog (a large PDF), as well as with the code that accompanies his book.
It isn't surprising to me that such a large image would take so much memory, given that it will be stored as an uncompressed bitmap in your CGImage. Aside from scaling your image to the resolution you need, and tiling it with CATiledLayer, I can't think of much. Are you releasing the CGImageRef once you've assigned it to the contents of the CAlayer? You won't need to hang onto it at that point.