I had a discussion with a friend about two questions regarding the performance of the OpenGL Rendering Pipeline, and we would like to ask for help in determining who is right.
I argued that the throughput scales linearly with the amount of pixels involved, and therefore rendering a 4k scene should take 4 times as long as rendering a 1080p scene. Then we discovered this resolution-fps comparison video [see 1], and the scaling does not seem to be linear. Could someone explain why this is the case?
I argued that rendering a 1080p scene and rendering every 1/4 pixel in a 4k scene should have the same performance, as in both cases the same amount of pixels are drawn [see 2]. My friend argued that this is not the case, as adjunct pixel calculations can be done with one instructions. Is a right? And if so, could someone explain how this works in practice?
Video
Illustration:
I argued that the throughput scales linearly with the amount of pixels involved, and therefore rendering a 4k scene should take 4 times as long as rendering a 1080p scene. Then we discovered this resolution-fps comparison video [see 1], and the scaling does not seem to be linear. Could someone explain why this is the case?
Remember: rendering happens in a pipeline. And rendering can only happen at the speed of the slowest part of that pipeline. Which part that is depends entirely on what you're rendering.
If you're shoving 2M triangles per frame at a GPU, and the GPU can only render 60M triangles per second, the highest framerate you will ever see is 30FPS. Your performance is bottlenecked on the vertex processing pipeline; the resolution you render to is irrelevant to the number of triangles in the scene.
Similarly, if you're rendering 5 triangles per frame, it doesn't matter what your resolution is; your GPU can chew that up in micro-seconds, and will be sitting around waiting for more. Your performance is bottlenecked on how much you're sending.
Resolution only scales linearly with performance if you're bottlenecked on the parts of the rendering pipeline that actually matter to resolution: rasterization, fragment processing, blending, etc. If those aren't your bottleneck, there's no guarantee that your performance will be impacted from increasing the resolution.
And it should be noted that modern high-performance GPUs require being forced to render a lot of stuff before they'll be bottlenecked on the fragment pipeline.
I argued that rendering a 1080p scene and rendering every 1/4 pixel in a 4k scene should have the same performance, as in both cases the same amount of pixels are drawn [see 2]. My friend argued that this is not the case, as adjunct pixel calculations can be done with one instructions. Is a right?
That depends entirely on how you manage to cause the system to "render every 1/4 pixel in a 4k scene". Rasterizers generally don't go around skipping pixels. So how do you intend to make the GPU pull off this feat? With a stencil buffer?
Personally, I can't imagine a way to pull this off without breaking SIMD, but I won't say it's impossible.
And if so, could someone explain how this works in practice?
You're talking about the very essence of Single-Instruction, Multiple Data (SIMD).
When you render a triangle, you execute a fragment shader on every fragment generated by the rasterizer. But you're executing the same fragment shader program on each of them. Each FS that operates on a fragment uses the same source code. They have the same "Single-Instructions".
The only difference between them is really the data they start with. Each fragment contains the interpolated per-vertex values provided by vertex processing. So they have "Multiple" sets of "Data".
So if they're all going to be executing the same instructions over different initial values... why bother executing them separately? Just execute them using SIMD techniques. Each opcode is executed on different sets of data. So you only have one hardware "execution unit", but that unit can process 4 (or more) fragments at once.
This execution model is basically why GPUs work.
Related
I wanted to come up with a crude way to "benchmark" the performance improvement of a tweak I made to a fragment shader (to be specific, I wanted to test the performance impact of the removal of the computation of the gamma for the resulting color using pow in the fragment shader).
So I figured that if a frame was taking 1ms to render an opaque cube model using my shader that if I set glDisable(GL_DEPTH_TEST) and loop my render call 100 times, that the frame would take 100ms to render.
I was wrong. Rendering it 100 times only results in about a 10x slowdown. Obviously if depth test is still enabled, most if not all of the fragments in the second and subsequent draw calls would not be computed because they would all fail the depth test.
However I must still be experiencing a lot of fragment culls even with depth test off.
My question is about whether my hardware (in this particular situation it is an iPad3 on iOS6.1 that I am experiencing this on -- a PowerVR SGX543MP4) is just being incredibly smart and is actually able to use the geometry of later draw calls to occlude and discard fragments from the earlier geometry. If this is not what's happening, then I cannot explain the better-than-expected performance that I am seeing. The question applies to all flavors of OpenGL and desktop GPUs as well, though.
Edit: I think an easy way to "get around" this optimization might be glEnable(GL_BLEND) or something of that sort. I will try this and report back.
PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.
Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.
In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.
I can confirm that only after adding both lines:
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA);
did the frame render time increase in a linear fashion in response to the repeated draw calls. Now back to my crude benchmarking.
I understand mipmapping pretty well. What I do not understand (on a hardware/driver level) is how mipmapping improves the performance of an application (at least this is often claimed). The driver does not know until the fragment shader is executed which mipmap level is going to be accessed, so anyway all mipmap levels need to be present in the VRAM, or am I wrong?
What exactly is causing the performance improvement?
You are no doubt aware that each texel in the lower LODs of the mip-chain covers a higher percentage of the total texture image area, correct?
When you sample a texture at a distant location the hardware will use a lower LOD. When this happens, the sample neighborhood necessary to resolve minification becomes smaller, so fewer (uncached) fetches are necessary. It is all about the amount of memory that actually has to be fetched during texture sampling, and not the amount of memory occupied (assuming you are not running into texture thrashing).
I think this probably deserves a visual representation, so I will borrow the following diagram from the excellent series of tutorials at arcsynthesis.org.
On the left, you see what happens when you naïvely sample at a single LOD all of the time (this diagram is showing linear minification filtering, by the way) and on the right you see what happens with mipmapping. Not only does it improve image quality by more closely matching the fragment's effective size, but because the number of texels in lower mipmap LODs are fewer it can be cached much more efficiently.
Mipmaps are useful at least for two reasons:
visual quality - scenes looks much better in the distance, there is more blur (which is usually better looking than flickering pixels). Additionally Anisotropic filtering can be used that improves visual quality a lot.
performance: since for distant objects we can use smaller texture the whole operation should be faster: sometimes the whole mip can be placed in the texture cache. It is called cache coherency.
great discussion from arcsynthesis about performance
in general mipmaps need only 33% more memory so it is quite low cost for having better quality and a potential performance gain. Note that real performance improvement should be measured for particular scene structure.
see info here: http://www.tomshardware.com/reviews/ati,819-2.html
I.To switching a shader effect which way is better?
1.using a big shader program and using uniform an if/else clause in shader program to use difference effect.
2.switch program between calls.
II.Is it better to use a big texture or use several small texture? And does upload texture cost mush,how about bind texture?
Well, it would probably be best to write some perf tests and try it but in general.
Small shaders are faster than big.
1 texture is faster the many textures
uploading textures is slow
binding textures is fast
switching programs is slow but usually much faster than combining 2 small programs into 1 big program.
Fragment shaders in particular get executed millions of times a frame. A 1920x1080 display has 2 million pixels so of there was no overdraw that would still mean your shader gets executed 2 million times per frame. For anything executed 2 million times a frame, or 120 million times a second of you're targeting 60 frames per second, smaller is going to be better.
As for textures, mips are faster than no mips because the GPU has a cache for textures and if the pixel it needs next are near the ones it previously read they'll likely already be in the cache. If they are far away they won't be in the cache. That also means randomly reading from a texture is particularly slow. But most apps read fairly linearly through a texture.
Switching programs is slow enough that sorting models by which program they use so that you draw all models that use program A first then all models that use program B is generally faster than drawing them in a random order. But there are other things the effect performance too. For example if a large model is obscuring a small model it's better to draw the large model first since the small model will then fail the depth test (z-buffer) and will not have its fragment shader executed for any pixels. So it's a trade off. All you can really do is test your particular application.
Also, it's important to test in the correct way.
http://updates.html5rocks.com/2012/07/How-to-measure-browser-graphics-performance
I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?
The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.
It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.
Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).
If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.
Since having blends is hitting perfomance of our game, we tried several blending strategies for creating the "illusion" of blending. One of them is drawing a sprite every odd frame, resulting in the sprite being visible half of the time. The effect is quit good. (You'd need a proper frame rate by the way, else your sprite would be noticeably flickering)
Despite that, I would like to know if there are any good insights out there in avoiding blending in order to better the overal performance without compromising (too much) of the visual experience.
Is it the actual blending that's killing your performance? (i.e. video memory bandwidth)
What games commonly do these days to handle lots of alpha blended stuff (think large explosions that cover whole screen): render them into a smaller texture (e.g. 2x2 smaller or 4x4 smaller than screen), and composite them back onto the main screen.
Doing that might require rendering depth buffer of opaque surfaces into that smaller texture as well, to properly handle intersections with opaque geometry. On some platforms (consoles) doing multisampling or depth buffer hackery might make that a very cheap operation; no such luck on regular PC though.
See article from GPU Gems 3 for example: High-Speed, Off-Screen Particles. Christer Ericson's blog post overviews a lot of optimization approaches as well: Optimizing the rendering of a particle system
Excellent article here about rendering particle systems quickly. It covers the smaller off screen buffer technique and suggest quite a few other approaches.
You can read it here
It is not quite clear from your question what kind of application of blending hits your game's performance. Generally blending is blazingly fast. If your problems are particle system related, then what is most likely to kill framerate is the number and size of particles drawn. Particularly lots of close up (and therefore large) particles will require high memory bandwidth and fill rate of the graphics card. I have implemented a particle system myself, and while I can render tons of particles in the distance, I feel the negative impact of e.g. flying through smoke (that will fill the entire screen because the viewer is amidst of it) very much on weaker hardware.