GPU Context Switch - gpgpu

I have a program which renders first to a texture, then pass the texture to the compute shader for processing, then renders the output result to the screen via a textured full screen quad.
I've read in the nVidia's programming guide for compute shaders that every time you dispatch a compute shader, it initiates a GPU device context switch, which should not be done quite often.
I'm very confused right now. The way I see it, in my rendering pipeline the GPU switches contexts twice. Right? Once during the first dispatch call, the next time when I render my full screen quad normally.
If this is correct, then I can avoid one switch by reorganizing my code like this. First, render to a texture. Second, do the processing on the compute shader. Then, IN THE NEXT FRAME, render the result, then (still in the next frame) render all updates to texture, do processing on compute shader... So basically every start of the frame I render the results of the last frame (the first frame will be an exception). Then there will only be one context switch, right?
But then the GPU will still have to do a context switch between the frames, right? So the two versions of my rendering pipeline all have two context switches. There would be no difference in performance. Am I correct?
Any help would be appreciated.

Context switch introduces a small hit, but in your case it would be pretty negligible, so you can safely switch between compute and render pipeline several times in the same frame without having to worry about it.
In lot of modern games there's more than 2 switches in the same pipeline (graphics pipeline for render, compute shader for light, pixelshader for fxaa...) and they still run fine.

Related

Minimum steps to implement depth-only pass

I have an existing OpenGL ES 3.1 application that renders a scene to an FBO with color and depth/stencil attachment. It uses the usual methods for drawing (glBindBuffer, glDrawArrays, glBlend*, glStencil* etc. ). My task is now to create a depth-only pass that fills the depth attachment with the same values as the main pass.
My question is: What is the minimum number of steps necessary to achieve this and avoid the GPU doing superfluous work (unnecessary shader invocations etc.)? Is deactivating the color attachment enough or do I also have to set null shaders, disable blending etc. ?
I assume you need this before the main pass runs, otherwise you would just keep the main pass depth.
Preflight
Create specialized buffers which contain only the mesh data needed to compute position (which are deinterleaved from all non-position data).
Create specialized vertex shaders (which compute only the output position).
Link programs with the simplest valid fragment shader.
Rendering
Render the depth-only pass using the specialized buffers and shaders, masking out all color writes.
Render the main pass with the full buffers and shaders.
Options
At step (2) above it might be beneficial to load the depth-only pass depth results as the starting depth for the main pass. This will give you better early-zs test accuracy, at the expense of the readback of the depth value. Most mobile GPUs have hidden surface removal, so this isn't always going to be a net gain - it depends on your content, target GPU, and how good your front-to-back draw order is.
You probably want to use the specialized buffers (position data interleaved in one buffer region, non-position interleaved in a second) for the main draw, as many GPUs will optimize out the non-position calculations if the primitive is culled.
The specialized buffers and optimized shaders can also be used for shadow mapping, and other such depth-only techniques.

Rendering Pipeline -- Performance -- Scaling with regard to amount of pixels

I had a discussion with a friend about two questions regarding the performance of the OpenGL Rendering Pipeline, and we would like to ask for help in determining who is right.
I argued that the throughput scales linearly with the amount of pixels involved, and therefore rendering a 4k scene should take 4 times as long as rendering a 1080p scene. Then we discovered this resolution-fps comparison video [see 1], and the scaling does not seem to be linear. Could someone explain why this is the case?
I argued that rendering a 1080p scene and rendering every 1/4 pixel in a 4k scene should have the same performance, as in both cases the same amount of pixels are drawn [see 2]. My friend argued that this is not the case, as adjunct pixel calculations can be done with one instructions. Is a right? And if so, could someone explain how this works in practice?
Video
Illustration:
I argued that the throughput scales linearly with the amount of pixels involved, and therefore rendering a 4k scene should take 4 times as long as rendering a 1080p scene. Then we discovered this resolution-fps comparison video [see 1], and the scaling does not seem to be linear. Could someone explain why this is the case?
Remember: rendering happens in a pipeline. And rendering can only happen at the speed of the slowest part of that pipeline. Which part that is depends entirely on what you're rendering.
If you're shoving 2M triangles per frame at a GPU, and the GPU can only render 60M triangles per second, the highest framerate you will ever see is 30FPS. Your performance is bottlenecked on the vertex processing pipeline; the resolution you render to is irrelevant to the number of triangles in the scene.
Similarly, if you're rendering 5 triangles per frame, it doesn't matter what your resolution is; your GPU can chew that up in micro-seconds, and will be sitting around waiting for more. Your performance is bottlenecked on how much you're sending.
Resolution only scales linearly with performance if you're bottlenecked on the parts of the rendering pipeline that actually matter to resolution: rasterization, fragment processing, blending, etc. If those aren't your bottleneck, there's no guarantee that your performance will be impacted from increasing the resolution.
And it should be noted that modern high-performance GPUs require being forced to render a lot of stuff before they'll be bottlenecked on the fragment pipeline.
I argued that rendering a 1080p scene and rendering every 1/4 pixel in a 4k scene should have the same performance, as in both cases the same amount of pixels are drawn [see 2]. My friend argued that this is not the case, as adjunct pixel calculations can be done with one instructions. Is a right?
That depends entirely on how you manage to cause the system to "render every 1/4 pixel in a 4k scene". Rasterizers generally don't go around skipping pixels. So how do you intend to make the GPU pull off this feat? With a stencil buffer?
Personally, I can't imagine a way to pull this off without breaking SIMD, but I won't say it's impossible.
And if so, could someone explain how this works in practice?
You're talking about the very essence of Single-Instruction, Multiple Data (SIMD).
When you render a triangle, you execute a fragment shader on every fragment generated by the rasterizer. But you're executing the same fragment shader program on each of them. Each FS that operates on a fragment uses the same source code. They have the same "Single-Instructions".
The only difference between them is really the data they start with. Each fragment contains the interpolated per-vertex values provided by vertex processing. So they have "Multiple" sets of "Data".
So if they're all going to be executing the same instructions over different initial values... why bother executing them separately? Just execute them using SIMD techniques. Each opcode is executed on different sets of data. So you only have one hardware "execution unit", but that unit can process 4 (or more) fragments at once.
This execution model is basically why GPUs work.

OpenGL ES: Is it more efficient to use glClear or use glDrawArrays with primitives that render over used frames?

For example, if I have several figures rendered over a black background, is it more efficient to call glClear(GL_COLOR_BUFFER_BIT) each frame, or render black triangles over artifacts from the past frame? I would think that rendering black triangles over artifacts would be faster, since less pixels need to be changed than clearing the entire screen. However, I have also read online that some drivers and graphics hardware perform optimizations when using glClear(GL_COLOR_BUFFER_BIT) that cannot occur when rendering black triangles instead.
Couple things you neglected which will probably change how you look at this:
Painting over the top with triangles doesn't play well with depth testing. So you have to disable testing first.
Painting over the top with triangles has to be strictly ordered with the previous result. While glClear breaks data dependencies, allowing out of order execution to start drawing the next frame as long as there's enough memory for both framebuffers to exist independently.
Double-buffering, to avoid tearing effects, uses multiple framebuffers anyway. So to use the result of the prior frame as a starting point for the new one, not only do you have to wait for the prior frame to finish rendering, you also have to copy its framebuffer into the buffer for the new frame. Which requires touching every pixel.
In a single buffering scenario, drawing over the top of the previous frame requires not only finishing rendering, but finishing streaming it to the display device. Which leaves you only the blanking interval to render the entire new frame. That almost certainly will prevent achieving maximum frame rate.
Generally, clearing touches every pixel just as copying does, but doesn't require flushing the pipeline. So it's much better to start a frame by glFlush.

OpenGL (ES): Can an implementation optimize fragments resulting from overdraw?

I wanted to come up with a crude way to "benchmark" the performance improvement of a tweak I made to a fragment shader (to be specific, I wanted to test the performance impact of the removal of the computation of the gamma for the resulting color using pow in the fragment shader).
So I figured that if a frame was taking 1ms to render an opaque cube model using my shader that if I set glDisable(GL_DEPTH_TEST) and loop my render call 100 times, that the frame would take 100ms to render.
I was wrong. Rendering it 100 times only results in about a 10x slowdown. Obviously if depth test is still enabled, most if not all of the fragments in the second and subsequent draw calls would not be computed because they would all fail the depth test.
However I must still be experiencing a lot of fragment culls even with depth test off.
My question is about whether my hardware (in this particular situation it is an iPad3 on iOS6.1 that I am experiencing this on -- a PowerVR SGX543MP4) is just being incredibly smart and is actually able to use the geometry of later draw calls to occlude and discard fragments from the earlier geometry. If this is not what's happening, then I cannot explain the better-than-expected performance that I am seeing. The question applies to all flavors of OpenGL and desktop GPUs as well, though.
Edit: I think an easy way to "get around" this optimization might be glEnable(GL_BLEND) or something of that sort. I will try this and report back.
PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.
Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.
In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.
I can confirm that only after adding both lines:
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA);
did the frame render time increase in a linear fashion in response to the repeated draw calls. Now back to my crude benchmarking.

glTexSubImage2D extremely slow on Intel video card

My video card is Mobile Intel 4 Series. I'm updating a texture with changing data every frame, here's my main loop:
for(;;) {
Timer timer;
glBindTexture(GL_TEXTURE2D, tex);
glBegin(GL_QUADS); ... /* draw textured quad */ ... glEnd();
glTexSubImage2D(GL_TEXTURE2D, 0, 0, 0, 512, 512,
GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, data);
swapBuffers();
cout << timer.Elapsed();
}
Every iteration takes 120ms. However, inserting glFlush before glTexSubImage2D brings the iteration time to 2ms.
The issue is not in the pixel format. I've tried the pixel formats BGRA, RGBA and ABGR_EXT together with the pixel types UNSIGNED_BYTE, BYTE, UNSIGNED_INT_8_8_8_8 and UNSIGNED_INT_8_8_8_8_EXT. The texture's internal pixel format is RGBA.
The order of calls matters. Moving the texture upload before the quad drawing, for example, fixes the slowness.
I also tried this on an GeForce GT 420M card, and it works fast there. My real app does have performance problems on non-Intel cards that are fixed by glFlush calls, but I haven't distilled those to a test case yet.
Any ideas on how to debug this?
One issue is that glTexImage2D performs a full reinitialization of the texture object. If only the data changes, but the format remains the same, use glTexSubImage2D to speed things up (just a reminder).
The other issue is, that despite its name the immediate mode, i.e. glBegin(…) … glEnd() the drawing calls are not synchronous, i.e. the calls return long before the GPU is done drawing. Adding a glFinish() will synchronize. But as well will do calls to anything that modifies data still required by queued operations. So in your case glTexImage2D (and glTexSubImage2D) must wait for the drawing to finish.
Usually it's best to do all volatile resource uploads at either the beginning of the drawing function, or during the SwapBuffers block in a separate thread through buffer objects. Buffer objects have been introduced for that very reason, to allow for asynchronous, yet tight operation.
I assume you're actually using that texture for one or more of your quads?
Uploading textures is one of the most expensive operations possible. Since your texture data changes every frame, the upload is unavoidable, but you should try to do it when the texture isn't in use by shaders. Remember that glBegin(GL_QUADS); ... glEnd(); doesn't actually draw quads, it requests that the GPU render the quads. Until the rendering completes, the texture will be locked. Depending on the implementation, this might cause the texture upload to wait (ala glFlush), but it could also cause the upload to fail, in which case you've wasted megabytes of PCIe bandwidth and the driver has to retry.
It sounds like you already have a solution: upload all new textures at the beginning of the frame. So what's your question?
NOTE: Intel integrated graphics are horribly slow anyway.
When you make a Draw Call ( glDrawElements, other ), the driver simply add this call in a buffer, and let the GPU consume these commands when it can.
If this buffer had to be consumed entirely at glSwapBuffers, this would mean that the GPU would be idle after that, waiting for you to send new commands.
Drivers solve this by letting the GPU lag one frame behind. This is the first reason why glTexSubImage2D blocks : the driver waits for the GPU not to use it anymore (in the previous frame) to begin the transfer, so that you never get half-updated data.
The other reason is that glTexSubImage2D is synchronous. Il will also block during the whole transfer.
You can solve the first issue by keeping 2 textures : one for the current frame, one for the previous frame. Upload the texture in the former, but draw with the latter.
You can solve the second issue by using a GL_TEXTURE_BUFFER Buffer Object, which allows asynchronous transfers.
In your case, I suspect that calling glTexSubImage2D just before glSwapBuffer adds an extra synchronization in the driver, whereas drawing the quad just before glSwapBuffer simply appends the command in the buffer. 120ms is probably a driver bug, though : even an Intel GMA doesn't need 120ms to upload a 512x512 texture.

Resources