Texture vs storage for FBO depth buffer

Texture vs storage for FBO depth buffer - opengl-es

Assuming the device support the GL_OES_depth_texture extension, is there any difference in terms of performance or memory consumption in attaching a storage or a texture to a FBO ?

Your post is tagged with OpenGLES 2.0 which most likely means you're talking about mobile.
Many Android mobile GPUs and all iOS GPUs are based on Tile Based Deferred Renderers - in this design, the rendering is all done to small (e.g. 32x32) tiles using special fast on-chip memory. In a typical rendering pass, with correct calls to glClear and glDiscardFramebufferEXT, there's no need for the device to ever have to copy depth buffer out from the on-chip memory into storage.
However, if you're using a depth texture, then this copy is unavoidable. The cost of transferring a screen-sized depth texture from on-chip memory into a texture is significant. However, I'd expect the rendering costs of your draw calls during the render pass to be unaffected.
In terms of memory usage, it's a bit more speculative. It's possible that a clever driver might not need to allocate any memory at all for a depth buffer on a TBDR GPU if you're not using a depth texture and you're using glClear and glDiscardFramebufferEXT correctly because at no point does your depth buffer have to be backed by any storage. Whether drivers actually do that is internal to the driver's implementation and you would have to ask the driver authors (Apple/Imagination Technologies/ARM, etc).
Finally, it may be the case that the depth buffer format has to undergo some reconfiguration to be usable as a depth texture which could mean it uses more memory and affect efficiency. I think that's unlikely though.
TLDR: Don't use a depth texture unless you actually need to, but if you do need one, then I don't think it will impact your rendering performance too much. The main cost is in the bandwidth of copying the depth data about.

Related

About the meaning of glInvalidateFramebuffer

I have a question about the general use of glInvalidateFramebuffer:
As far as I know, the purpose of glInvalidateFramebuffer is to "skip the store of framebuffer contents that are no longer needed". Its main purpose on tile based gpus is to get rid of depth and stencil contents if only color is needed after rendering. I do not understand why this is necessary. As far as I know if I render to an FBO then all of this data is stored in that FBO. Now if I do something with only the color contents or nothing with that FBO in a subsequent draw, why is the depth/stencil data accessed at all? It is supposedly stored somewhere and that eats bandwidth, but as far as I can tell it is already in FBOs GPU memory as the result of the render so when does that supposed expensive additional store operation happen?
There are supposedly expensive preservaton steps for FBO attachments but why are those necessary if the data is already in Gpu memory as result of the render?
Regards

Framebuffers in a tile-based GPU exist in two places - the version stored in main memory (which is persistent), and the working copy inside the GPU tile buffer (which only exists for the duration of that tile's fragment shading). The content of the tile buffer is written back to the buffer in main memory at the end of shading for each tile.
The objective of tile-based shading is to keep as much of the state inside that tile buffer, and avoid writing it back to main memory if it's not needed. This is important because main memory DRAM accesses are phenomenally power hungry. Invalidation at the end of each render pass tells the graphics stack that those buffers don't need to be persisted, so means the write back from tile buffer to main memory can be avoided.
I've written a longer article on it here if you want more detail:
https://developer.arm.com/solutions/graphics/developer-guides/understanding-render-passes/single-view
For non-tile-based GPUs the main use case seems to be using it as a lower cost version of a clear at the start of a render pass if you don't actually care about the starting color. It's likely there is little benefit to using it at the end of the render pass (but it should not hurt either).

How do I see the GPU's bottleneck in a complex algorithm?

I'm using GLSL fragment shaders for GPGPU calculations (I have my reasons).
In nSight I see that I'm doing 1600 drawcalls per frame.
There could be 3 bottlenecks:
Fillrate
Just too many drawcalls
GPU stalls due to my GPU->CPU downloads and CPU->GPU uploads
How do I find which one it is?
If my algorithm was simple (e.g. a gaussian blur or something), I could force the viewport of each drawcall to be 1x1, and depending on the speed change, I could rule out a fillrate problem.
In my case, though, that would require changing the entire algorithm.

Since you're mentioning Nvidia NSight tool, you could try to follow the procedures explained in the following Nvidia blog post.
It explains how to read and understand hardware performance counters to interpret performance bottlenecks.
The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload :
https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/

Instead of finding the one, change the ways to calculate.
I'm using GLSL fragment shaders for GPGPU calculations (I have my reasons).
I am not sure what your OpenGL version is but using computer shader over FS will solve the issue
In nSight I see that I'm doing 1600 drawcalls per frame.
Do you mean actual OpenGL drawcalls? it muse be one of reasons for sure. You may draw something on FBOs to calculate them using GPU. That is the big difference between Computer Shader & Fragment Shader. Draw calls always slow down the program but Computer shader.
An architectural advantage of compute shaders for image processing is
that they skip the ROP(Render output unit) step. It's very likely that writes from pixel
shaders go through all the regular blending hardware even if you don't
use it.
If you have to use FS somehow, then
try to find reduce the drawcalls.
find the way to store data that is being calculated.
It would be like using render textures as a memory, if you need to change vertices using RTTs, you would have to load textures as position, velocity or whatever you need to change vertices or its attributes like normal/color.
To find the actual reason, better use GPU& GPU profilers depending on your chipset and OS.

Optimized GPU to CPU data transfer

I'm a bit out of my depth here (best way to be me thinks), but I am poking around looking for an optimization that could reduce GPU to CPU data transfer for my application.
I have an application that performs some modifications to vertex data in the GPU. Occasionally the CPU has to read back parts of the modified vertex data and then compute some parameters which then get passed back into the GPU shader via uniforms, forming a loop.
It takes too long to transfer all the vertex data back to the CPU and then sift through it on the CPU (millions of points), and so I have a "hack" in place to reduce the workload to usable, although not optimal.
What I do:
CPU: read image
CPU: generate 1 vertex per pixel, Z based on colour information/filter etc
CPU: transfer all vertex data to GPU
GPU: transform feedback used to update GL_POINT vertex coords in realtime based on some uniform parameters set from the CPU.
When I wish to read only a rectangular "section", I use glMapBufferRange to map the entire rows that comprise the desired rect (bad diagram alert):
This is supposed to represent the image/set of vertices in the GPU. My "hack" involves having to read all the blue and red vertices. This is because I can only specify 1 continuous range of data to read back.
Does anyone know a clever way to efficiently get at the red, without the blue? (without having to issue a series of glMapBufferRange calls)
EDIT-
The use case is that I render the image into a 3D world as GLPoints, coloured and offset in the Z by an amount based on the colour info (sized etc according to distance). Then the user can modify the vertex Z data with a mouse cursor brush. The logic behind some of the brush application code needs to know the Z's of the area under the mouse (brush circle), eg. min/max/average etc so that the CPU can control the shaders modification of data by setting a series of uniforms that feed into the shader. So for example the user can say, I want all points under the cursor to set to the average value. This could all probably be done entirely in the GPU, but the idea is that once I get the CPU-GPU "loop" (optimised as far as I can reasonably do), I can then expand out the min/max/avg stuff to do interesting things on the CPU that would be cumbersome (probably) to do entirely on the GPU.
Cheers!
Laythe

To get any data from the GPU to the CPU you need to map the GPU memory in any case, which means the OpenGL application will have to use something like mmap under the hood. I have checked the implementation of that for both x86 and ARM, and it looks like it is page-aligned, so you cannot map less than 1 contiguous page of GPU memory at any given time, so even if you could request to map just the red areas, you quite likely would also get the blue ones as well (depending on your page and pixel data sizes).
Solution 1
Just use glReadPixels, as this allows you to select a window of the framebuffer. I assume a GPU vendor like Intel would optimize the driver, so it would map as few pages as possible, however this is not guaranteed, and in some cases you may need to map 2 pages just for 2 pixels.
Solution 2
Create a compute shader or use several glCopyBufferSubData calls to copy your region of interest into a contiguous buffer in GPU memory. If you know the height and width you want, you can then un-mangle and get a 2D buffer back on the CPU side.
Which of the above solutions works better depends on your hardware and driver implementation. If GPU->CPU is the bottleneck and GPU->GPU is fast, then the second solution may work well, however you would have to experiment.
Solution 3
As suggested in the comments, do everything on the GPU. This heavily depends on whether the work is parallelize-able well, but if the copying of memory is too slow for you, then you don't have much other choice.

I suppose you are asking because you can not do all work at shaders, right?
If you render to a Frame Buffer Object, then bind it as GL_READ_FRAMEBUFFER, you can read a block of it by glReadPixels.

OpenGL/DirectX: How does Mipmapping improve performance?

I understand mipmapping pretty well. What I do not understand (on a hardware/driver level) is how mipmapping improves the performance of an application (at least this is often claimed). The driver does not know until the fragment shader is executed which mipmap level is going to be accessed, so anyway all mipmap levels need to be present in the VRAM, or am I wrong?
What exactly is causing the performance improvement?

You are no doubt aware that each texel in the lower LODs of the mip-chain covers a higher percentage of the total texture image area, correct?
When you sample a texture at a distant location the hardware will use a lower LOD. When this happens, the sample neighborhood necessary to resolve minification becomes smaller, so fewer (uncached) fetches are necessary. It is all about the amount of memory that actually has to be fetched during texture sampling, and not the amount of memory occupied (assuming you are not running into texture thrashing).
I think this probably deserves a visual representation, so I will borrow the following diagram from the excellent series of tutorials at arcsynthesis.org.
On the left, you see what happens when you naïvely sample at a single LOD all of the time (this diagram is showing linear minification filtering, by the way) and on the right you see what happens with mipmapping. Not only does it improve image quality by more closely matching the fragment's effective size, but because the number of texels in lower mipmap LODs are fewer it can be cached much more efficiently.

Mipmaps are useful at least for two reasons:
visual quality - scenes looks much better in the distance, there is more blur (which is usually better looking than flickering pixels). Additionally Anisotropic filtering can be used that improves visual quality a lot.
performance: since for distant objects we can use smaller texture the whole operation should be faster: sometimes the whole mip can be placed in the texture cache. It is called cache coherency.
great discussion from arcsynthesis about performance
in general mipmaps need only 33% more memory so it is quite low cost for having better quality and a potential performance gain. Note that real performance improvement should be measured for particular scene structure.
see info here: http://www.tomshardware.com/reviews/ati,819-2.html

graphics: best performance with floating point accumulation images

I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?

The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.

It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.

Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).

If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio