About the meaning of glInvalidateFramebuffer - opengl-es

I have a question about the general use of glInvalidateFramebuffer:
As far as I know, the purpose of glInvalidateFramebuffer is to "skip the store of framebuffer contents that are no longer needed". Its main purpose on tile based gpus is to get rid of depth and stencil contents if only color is needed after rendering. I do not understand why this is necessary. As far as I know if I render to an FBO then all of this data is stored in that FBO. Now if I do something with only the color contents or nothing with that FBO in a subsequent draw, why is the depth/stencil data accessed at all? It is supposedly stored somewhere and that eats bandwidth, but as far as I can tell it is already in FBOs GPU memory as the result of the render so when does that supposed expensive additional store operation happen?
There are supposedly expensive preservaton steps for FBO attachments but why are those necessary if the data is already in Gpu memory as result of the render?
Regards

Framebuffers in a tile-based GPU exist in two places - the version stored in main memory (which is persistent), and the working copy inside the GPU tile buffer (which only exists for the duration of that tile's fragment shading). The content of the tile buffer is written back to the buffer in main memory at the end of shading for each tile.
The objective of tile-based shading is to keep as much of the state inside that tile buffer, and avoid writing it back to main memory if it's not needed. This is important because main memory DRAM accesses are phenomenally power hungry. Invalidation at the end of each render pass tells the graphics stack that those buffers don't need to be persisted, so means the write back from tile buffer to main memory can be avoided.
I've written a longer article on it here if you want more detail:
https://developer.arm.com/solutions/graphics/developer-guides/understanding-render-passes/single-view
For non-tile-based GPUs the main use case seems to be using it as a lower cost version of a clear at the start of a render pass if you don't actually care about the starting color. It's likely there is little benefit to using it at the end of the render pass (but it should not hurt either).

Related

Vulkan/OpenGL subpasses that fetch more than single fragment

So, Vulkan introduced subpasses and opengl implelemts similar behaviour with ARM_framebuffer_fetch
In the past, I have used framebuffer_fetch successfully for tonemapping post-effect shaders.
Back then the limitation was that one could only read the contents of the framebuffer at the location of the currently rendered fragment.
Now, what I wonder is whether there is any way by now in Vulkan (or even OpenGL ES) to read from multiple locations (for example to implement a blur kernel) without having a tiled hardware to store/load to RAM.
In theory I guess it should be possible, the first pass wpuld just need to render slightly larger than the blur subpass, based on kernel size (so for example if kernel size was 4 pixels then the tile resolved would need to be 4 pixels smaller than the in-tile buffer sizes) and some pixels would have to be rendered redundantly (on the overlaps of tiles).
Now, is there a way to do that?
I seem to recall having seen some Vulkan instruction related to subpasses that would allow to define the support size (which sounded like what I’m looking for now) but I can’t recall where I saw that.
So my questions:
With Vulkan on a mobile tiled renderer architecture, is it possible to forward-render some geometry and the render a full-screen blur over it, all within a single in-tile pass (without the hardware having to store the result of the intermediate pass to ram first and then load the texture from ram when bluring)? If so, how?
If the answer to 1 is yes, can it also be done in OpenGL ES?
Short answer, no. Vulkan subpasses still have the 1:1 fragment-to-pixel association requirements.

Rendering only to a part of a texture

Can I bind a 2000x2000 texture to a color attachment in a FBO, and tell OpenGL to behave exactly as if the texture was smaller, let's say 1000x1000?
The point is, in my rendering cycle I need many (mostly small) intermediate textures to render to, but I need only 1 at a time. I am thinking that, rather than creating many smaller textures, I will have only 1 appropriately large, and I will bind it to an FBO at hand, tell OpenGL to render only to part of it, and save memory.
Or maybe I should be destroying/recreating those textures many times per frame? That would certainly save even more memory, but wouldn't that cause a noticeable slowdown?
Can I bind a 2000x2000 texture to a color attachment in a FBO, and
tell OpenGL to behave exactly as if the texture was smaller, let's say
1000x1000?
Yes, just set glViewport() to the region you want to render to, and remember to adjust glScissor() bounding regions if you are ever enabling scissor testing.
Or maybe I should be destroying/recreating those textures many times
per frame? That would certainly save even more memory, but wouldn't
that cause a noticeable slowdown?
Completely destroying and recreating a new texture object every frame will be slow because it will cause constant memory reallocation overhead, so definitely don't do that.
Having a pool of pre-allocated textures which you cycle though is fine though - that's a pretty common technique. You won't really save much in terms of memory storing a 2K*2K texture vs storing 4 separate 1K*1K textures - the total storage requirement is the same and the additional metadata overhead is tiny in comparison - so if keeping them separate is easier in terms of application logic I'd suggest doing that.

Texture vs storage for FBO depth buffer

Assuming the device support the GL_OES_depth_texture extension, is there any difference in terms of performance or memory consumption in attaching a storage or a texture to a FBO ?
Your post is tagged with OpenGLES 2.0 which most likely means you're talking about mobile.
Many Android mobile GPUs and all iOS GPUs are based on Tile Based Deferred Renderers - in this design, the rendering is all done to small (e.g. 32x32) tiles using special fast on-chip memory. In a typical rendering pass, with correct calls to glClear and glDiscardFramebufferEXT, there's no need for the device to ever have to copy depth buffer out from the on-chip memory into storage.
However, if you're using a depth texture, then this copy is unavoidable. The cost of transferring a screen-sized depth texture from on-chip memory into a texture is significant. However, I'd expect the rendering costs of your draw calls during the render pass to be unaffected.
In terms of memory usage, it's a bit more speculative. It's possible that a clever driver might not need to allocate any memory at all for a depth buffer on a TBDR GPU if you're not using a depth texture and you're using glClear and glDiscardFramebufferEXT correctly because at no point does your depth buffer have to be backed by any storage. Whether drivers actually do that is internal to the driver's implementation and you would have to ask the driver authors (Apple/Imagination Technologies/ARM, etc).
Finally, it may be the case that the depth buffer format has to undergo some reconfiguration to be usable as a depth texture which could mean it uses more memory and affect efficiency. I think that's unlikely though.
TLDR: Don't use a depth texture unless you actually need to, but if you do need one, then I don't think it will impact your rendering performance too much. The main cost is in the bandwidth of copying the depth data about.

Fastest method for blitting from a pixel buffer into a device context

Good evening,
I have several 32-bit images in memory buffers that I wish to "blit" to a device context, quickly. Speed is an issue here because the buffer will be manipulated constantly and need to be blitted to the DC repeatedly.
The color depth of the buffer is 32-bits, so it is already in the DIB-expected format of SetDIBits(). However, this is rather cumbersome since the bitmap target of SetDIBits() cannot be already selected into the DC prior to the operation. So I will need to constantly swap out the DC's bitmap, call SetDIBits(), swap the bitmap back into the DC, and then blit the DC to the Window's DC. To me, that just seems like a LOT of workload on the CPU and too much branching in the Windows API; way too much for optimal performance.
I would be interested in using DirectX if it didn't force me to use Device Contexts for 2D operations, or uploading textures to video memory before displaying them, because the contents of the image are constantly changing.
My question is simple (despite the long writeup). What would be the fastest way for me to blit an image from a pixel buffer in memory onto the screen? Direct access to the pixel buffer of a DC would be great, but I know that's not going to happen.
Thanks for reading my long writeup.
There is an API method CreateDIBSection to create a DIB that applications can write to directly. This allows to continuously updating the bitmap (either memcopy or directly writing to it).
See MSDN article for further details.
Access to the bitmap must be synchronized. Do this by calling the GdiFlush function.

Is it possible to use pointers to write directly (low level) onto a window without using Bitblt?

I have written an anaglyph filter that mixes two images into one stereographic image. It is a fast routine that works with one pixel at a time.
Right now I'm using pointers to output each calculated pixel to a memory bitmap, then Bitblt that whole image onto the window.
This seems redundant to me. I'd rather copy each pixel directly to the screen, since my anaglyph routine is quite fast. Is it possible to bypass Bitblt and simply have the pointer point directly to wherever Bitblt would copy it to?
I'm sure it's possible, but you really really really don't want to do this. It's much more efficient to draw the entire pattern at once.
You can't draw directly to the screen from windows because the graphics card memory isn't necessarily mapped in any sane order.
Bltting to the screen is amazingly fast.
Remember you don't blt after each pixel - only when you want a new result to be shown, even then there's no point doing this faster than the refresh on your screen - probably 60hz
You are looking for something like glMapBuffer in OpenGL, but acessing directly to the screen.
But writing to the GPU memory pixel per pixel is the slower operation you can do. PCI works faster if you send big streams of data. Also, there are many issues if you write and read data. And the pixel layout is also important (see nvidia docs about fast texture transfers). Bitblt will do it for you in a driver optimised way.

Resources