OpenGL ES 2.0 - Reducing memcopies when rendering to textures

OpenGL ES 2.0 - Reducing memcopies when rendering to textures - performance

Lets say I have four content layers: A, B, C and D; each one represents one type of visual content.
Each layer does several sequential render calls (there are no interleaved render calls from the various layers).
Also, layer B and D need to be rendered to textures in order to apply visual effects. In order to reduce the memory footprint, I use only one FBO with only one texture.
So, at the moment I do:
Render A content;
Bind FBO > Render B content > Unbind FBO > Render texture (B content);
Render C;
Bind FBO > Render D content > Unbind FBO > Render texture (D content).
My main problem with this approach is that every time I bind/unbind the FBO, the default framebuffer is saved/restored to/from memory.
I cannot simply draw layers B and D to the FBO first , since I cannot change the rendering order of layers.
Is there any better way to do this and avoid many saves/restores of the main framebuffer? Keep in mind that this is an example and the real case is more complex (more layers).

Being able to switch between render targets quickly is the primary original purpose of the FBO feature. It is generally faster than the older pbuffer approach because it does not have to deal with changing rendering contexts. Also, FBOs are not as dependent on the EGL to allocate the rendering surfaces as pbuffers are. If the Adreno 200 can not switch FBOs quickly, then that is an implementation problem specific to Adreno.

The driver will bring the contents of the FBO back from memory (which is a costly operation), unless you clear the FBO with glClear() before drawing. Clearing the FBO will hint the driver to discard the current contents and not bring them from main memory. If your FBOs have a depth buffer, make sure you include that bit also when calling glClear().
Since you are only using one FBO, you may not be able to get around it. Check if your memory requirements are so strict that you can not use a second FBO, so you can render B to your first FBO, then D to your second, and then all your content layers together to the screen.

Related

How to render Direct3D content into ID2D1HwndRenderTarget?

In my app I render my Direct3D content to the window, as recommended, using the swap chain in flip sequential mode (IDXGISwapChain1, DXGI_SWAP_EFFECT_SEQUENTIAL), in a window without redirection bitmap (WS_EX_NOREDIRECTIONBITMAP)
However for the purpose of smooth window resizing (without lags and flicker), I want to intercept live resize enter/exit events and temporarily switch the rendering to ID2D1HwndRenderTarget to the window's redirection surface, which seems to be offer the smoothest possible resize.
The question is - how to render the Direct3D content to ID2D1HwndRenderTarget?
The problem is that obviously the Direct3D rendering and ID2D1HwndRenderTarget belong to different devices and I don't seem to find a way to interop them together.
What ideas come to mind:
A. Somehow assign ID2D1HwndRenderTarget to be the output frame buffer for 3D rendering. This would be the best case scenario, because it would minimize buffer copying.
Somehow obtain a ID3D11Texture2D or at least an IDXGISurface from the ID2D1HwndRenderTarget
Create render target ID3D11Device::CreateRenderTargetView from the DXGI surface / Direct3D texture
Set render target ID3D11DeviceContext::OMSetRenderTargets
Render directly to the ID2D1HwndRenderTarget
The problem is in step 1. ID2D1HwndRenderTarget and its ancestor ID2D1RenderTarget seem pretty scarce and don't seem to provide this capability. Is there a way to obtain a DXGI or D3D surface from the ID2D1HwndRenderTarget?
B. Create an off-screen Direct3D frame buffer texture, render the 3D content there and then copy it to the window render target.
Create off-screen texture ID3D11Texture2D
(ID3D11Device::CreateTexture2D)
Create a ID2D1Device using D2D1CreateDevice (from the same
IDXGIDevice as my ID3D11Device)
Create a ID2D1DeviceContext
using ID2D1Device::CreateDeviceContext
Create a ID2D1Bitmap
using ID2D1DeviceContext::CreateBitmapFromDxgiSurface
Draw the
bitmap using ID2D1HwndRenderTarget::DrawBitmap
Unfortunately, I get the error message "An operation failed because a device-dependent resource is associated with the wrong ID2D1Device (resource domain)". Obviously, the resource comes from a different ID2D1Device But how to draw a texture bitmap from one Direct2D device onto another?
C. Out of desperation, I tried to map the content of my frame buffer to CPU memory using IDXGISurface::Map, however this is tricky because it requires D3D11_USAGE_STAGING and D3D11_CPU_ACCESS_READ flags when creating the texture, which then seems make it impossible to use this texture as an output frame buffer (or am I missing something?). And generally this technique will be most likely very slow, because it involves syncing between CPU and GPU, and copying the whole texture at least two times.
Has anyone ever succeeded with the task of rendering 3D content to a ID2D1HwndRenderTarget? Please share your experience. Thanks!

Minimum steps to implement depth-only pass

I have an existing OpenGL ES 3.1 application that renders a scene to an FBO with color and depth/stencil attachment. It uses the usual methods for drawing (glBindBuffer, glDrawArrays, glBlend*, glStencil* etc. ). My task is now to create a depth-only pass that fills the depth attachment with the same values as the main pass.
My question is: What is the minimum number of steps necessary to achieve this and avoid the GPU doing superfluous work (unnecessary shader invocations etc.)? Is deactivating the color attachment enough or do I also have to set null shaders, disable blending etc. ?

I assume you need this before the main pass runs, otherwise you would just keep the main pass depth.
Preflight
Create specialized buffers which contain only the mesh data needed to compute position (which are deinterleaved from all non-position data).
Create specialized vertex shaders (which compute only the output position).
Link programs with the simplest valid fragment shader.
Rendering
Render the depth-only pass using the specialized buffers and shaders, masking out all color writes.
Render the main pass with the full buffers and shaders.
Options
At step (2) above it might be beneficial to load the depth-only pass depth results as the starting depth for the main pass. This will give you better early-zs test accuracy, at the expense of the readback of the depth value. Most mobile GPUs have hidden surface removal, so this isn't always going to be a net gain - it depends on your content, target GPU, and how good your front-to-back draw order is.
You probably want to use the specialized buffers (position data interleaved in one buffer region, non-position interleaved in a second) for the main draw, as many GPUs will optimize out the non-position calculations if the primitive is culled.
The specialized buffers and optimized shaders can also be used for shadow mapping, and other such depth-only techniques.

About the meaning of glInvalidateFramebuffer

I have a question about the general use of glInvalidateFramebuffer:
As far as I know, the purpose of glInvalidateFramebuffer is to "skip the store of framebuffer contents that are no longer needed". Its main purpose on tile based gpus is to get rid of depth and stencil contents if only color is needed after rendering. I do not understand why this is necessary. As far as I know if I render to an FBO then all of this data is stored in that FBO. Now if I do something with only the color contents or nothing with that FBO in a subsequent draw, why is the depth/stencil data accessed at all? It is supposedly stored somewhere and that eats bandwidth, but as far as I can tell it is already in FBOs GPU memory as the result of the render so when does that supposed expensive additional store operation happen?
There are supposedly expensive preservaton steps for FBO attachments but why are those necessary if the data is already in Gpu memory as result of the render?
Regards

Framebuffers in a tile-based GPU exist in two places - the version stored in main memory (which is persistent), and the working copy inside the GPU tile buffer (which only exists for the duration of that tile's fragment shading). The content of the tile buffer is written back to the buffer in main memory at the end of shading for each tile.
The objective of tile-based shading is to keep as much of the state inside that tile buffer, and avoid writing it back to main memory if it's not needed. This is important because main memory DRAM accesses are phenomenally power hungry. Invalidation at the end of each render pass tells the graphics stack that those buffers don't need to be persisted, so means the write back from tile buffer to main memory can be avoided.
I've written a longer article on it here if you want more detail:
https://developer.arm.com/solutions/graphics/developer-guides/understanding-render-passes/single-view
For non-tile-based GPUs the main use case seems to be using it as a lower cost version of a clear at the start of a render pass if you don't actually care about the starting color. It's likely there is little benefit to using it at the end of the render pass (but it should not hurt either).

Rendering only to a part of a texture

Can I bind a 2000x2000 texture to a color attachment in a FBO, and tell OpenGL to behave exactly as if the texture was smaller, let's say 1000x1000?
The point is, in my rendering cycle I need many (mostly small) intermediate textures to render to, but I need only 1 at a time. I am thinking that, rather than creating many smaller textures, I will have only 1 appropriately large, and I will bind it to an FBO at hand, tell OpenGL to render only to part of it, and save memory.
Or maybe I should be destroying/recreating those textures many times per frame? That would certainly save even more memory, but wouldn't that cause a noticeable slowdown?

Can I bind a 2000x2000 texture to a color attachment in a FBO, and
tell OpenGL to behave exactly as if the texture was smaller, let's say
1000x1000?
Yes, just set glViewport() to the region you want to render to, and remember to adjust glScissor() bounding regions if you are ever enabling scissor testing.
Or maybe I should be destroying/recreating those textures many times
per frame? That would certainly save even more memory, but wouldn't
that cause a noticeable slowdown?
Completely destroying and recreating a new texture object every frame will be slow because it will cause constant memory reallocation overhead, so definitely don't do that.
Having a pool of pre-allocated textures which you cycle though is fine though - that's a pretty common technique. You won't really save much in terms of memory storing a 2K*2K texture vs storing 4 separate 1K*1K textures - the total storage requirement is the same and the additional metadata overhead is tiny in comparison - so if keeping them separate is easier in terms of application logic I'd suggest doing that.

WebGL: Framebuffers and textures with one, one-byte channel?

I'm generating blurred drop shadows in WebGL by drawing the object to be blurred onto an off-screen framebuffer/texture, then applying a few passes of a filter to it (back and forth between two off-screen framebuffers), then copying the result to the final output.
However, I'm just dropping the RGB channels, overwriting them with the desired color of the drop shadow (usually black) while maintaining the alpha channel. It seems like I could probably get better performance by just having my off-screen framebuffers be a single (alpha) channel.
Is there a way to do that, and would it actually help?
Also, is there a better way to apply multiple passes of a filter than just alternating between two frame buffers and using the previous frame buffer's bound texture as the input?

Assuming WebGL follows GLES then per the spec (Page 91):
The name of the color buffer of an application-created framebuffer object
is COLOR_ATTACHMENT0 ... Color buffers consist of R, G, B, and,
optionally, A unsigned integer values.
So you can't attach only to A, or only to any single colour channel.
Options to explore:
Use colorMask to disable writing to R, G and B. Depending on what data layout your GPU uses internally you can imagine that could effectively achieve exactly what you want or possibly have no effect whatsoever.
Is there a way you could render to the depth channel instead of to the alpha channel?
Reducing memory bandwidth is often helpful but if it's not a bottleneck then you could end up prematurely optimising.
To avoid excessive per-frame ping-ponging you'd normally attempt to reform your shader so that it does the effect of all the stages in one. Otherwise consider whether there's any better than-linear way to combine multiple passes. Instead of knowing only how to get from stage n to stage n+1, can you go from stage n to stage 2n? Or even just n+2?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio