Delayed reading from generated texture in OpenGL - performance

I use a shader program to generate some data in a texture in OpenGL, and I want to read the data back in from OpenGL to use it on the CPU. Normally, of course, reading texture data involves flushing the pipeline, so that the data is actually there and ready, with obvious consequences for performance.
However, I don't actually need the data immediately, and could precisely as well wait until it's ready and then read it. Is there any way to do this? I guess I could perhaps wait until I'm swapping buffers anyway and read the data then, but would this cause any performance issues anyway (due to having to flush twice or something)? Is there any other way to do it?

Asynchronous image data transfers can be done with Pixel Buffer Objects. The Idea is that you create a PBO and initiate the texutre readback into it, and the GL will do the transfer asynchronously. It will only have to force a sync if you try to access the PBO before the transfer is completed.
You could further combine this with a fence sync object via glFenceSync() and actually query if the transfer has been completed before trying to map/read back the PBO, and if not so, do something else on the CPU instead of wasting time waiting.

Related

How to detect all GPU/CPU transfers in pytorch?

I have a large pytorch project. How do I check if the project is performing any unexpected transfers of data from the GPU to the CPU or back?
As I understand it, GPU/CPU transfers are very costly for performance, and they can easily happen by accident if you are not very careful with your code. For example, calling .item() on a tensor on the GPU will cause that tensor to be transferred back to CPU, while blocking CPU execution.
Since there are a lot of ways to do this by accident, I need a reliable way to get all places in my code where data is transferred, so that I can prevent costly performance losses.
Ideally, I would be able to specify a section of code with a Context Manager that says "GPU/CPU transfers are allowed inside this block only. Any attempt to transfer data between GPU and CPU outside of this block causes an error."

glBufferSubData is very slow on many android device

I have requested about 2M gl buffers for share, and update the data for vertex and index with glBufferSubData, it works fine on my iOS devices. while, when I test it on my android devices, it very very slow.
I have found some notes from the official website:
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferSubData.xhtml
what does "that rendering must drain from the pipeline before the data store can be updated" really mean?
The performance difference you're seeing is likely not simply an iOS/Android difference but will be very specific to both your usage of the API and the implementation of glBufferSubData in the driver. Without seeing more code, or knowing what performance metrics you're gathering, it's hard to comment further.
what does "that rendering must drain from the pipeline before the data
store can be updated" really mean?
The idea here is that whilst the OpenGL API gives the illusion that each command is executed to completion before continuing, in fact, drawing is generally batched up and done asynchronously in the background. The problem here is that glBufferSubData is potentially adding a synchronisation point, which will mean that the driver will have to stall until all previous rendering using that buffer has completed before continuing.
Consider the following example. In a good case, we might have something like this:
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 2 with FGHIJ
Draw call using buffer 2
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
However if you're overwriting the same buffer, you will get this instead.
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 1, overwriting with FGHIJ <----- Synchronisation point, as the driver must ensure that the buffer has finished being used by first draw call before modifying the data
Draw call using updated buffer 1
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
As you can see, you can potentially end up with a second synchronisation point. However as mentioned before, this is somewhat driver specific. For example some drivers might be able to detect the case where the section of the buffer you're updating isn't in use by the previous draw call, whilst others might not. Something of this nature is probably what's causing the performance difference you're seeing.

Process THREE.Texture in webworkers

I would like to process THREE.Texture in web workers to help post processing effects such as bloom. For instance for the bloom effect, I would first draw the scene on a THREE.Texture object and then I would like to handle the blur process in the web worker. What would be the most efficient way to pass the THREE.Texture data to the worker and create a new THREE.Texture from the data obtained from the worker. Since I would do that 60 times per second ideally, I need a fast and memory friendly way to do that (memory friendly: not to create new objects in a loop but rather re-use existing objects).
I'm aware that canvas2DContext.getImageData may be helpful but that probably is not the best way since I'd draw to canvas 60 times per second and that would slow things down.
Thanks!
PS: I should specify that in this approach, I don't intend to wait for the worker to finish processing the texture to render the final result. Since most of the objects are static I don't think that would be a big deal anyway. I wanna test it to see how it goes for the dynamic objects though.
Passing a GPU based texture to a web worker would not speed up anything in fact it would be significantly slower.
It's extremely slow to transfer memory from the GPU to the CPU (and CPU to GPU as well) relative to doing everything on the GPU. The only way to pass the contents of a texture to a worker is to ask WebGL to copy from the GPU to the CPU (using gl.readPixels on in three whatever it's wrapper for gl.readPixels is) and then transfer the result to the worker. Then in the worker all you could do is a slow CPU based blur (slower than it would have been on the GPU), then you'd have to transfer it back to the main thread only to upload it again via gl.texImage2D or telling three.js to do it for you which is also a slow operation copying the data from the CPU back to the GPU.
The fast way to apply a blur is to do it on the GPU.
Further, there is no way to share WebGL resources between the main thread and a worker nor is there likely to be anytime soon. Even if you could share the resource and then from the worker ask the GPU to do the blur that would save no time as well as for the most part GPUs don't run different operations in parallel (not generically multi-process like CPUs) so all you'd end up doing is asking the GPU to do the same amount of work.

glDrawArrays - guaranteed to copy?

Following scenario:
I have a buffer of vertices in system memory whose address I submit to glDrawArrays via glVertexAttribPointer every frame. The rendering API is OpenGL ES 3.0.
Now my question:
Can I assume, that glDrawArrays will create a full copy of the buffer on every draw call? Or is it possible that it will draw from the buffer directly if I'm on a shared memory platform?
Regards
For all porpoises you need to treat it as "that it will draw from the buffer directly".
When setting the pointer you do not tell the openGL about any sizes of the buffer so at that point there is nothing set at all but some pointer on the GPU or rather an integer value since the same procedure is used when using GPU vertex buffers.
So the data size is only determined when calling draw where you say how many vertices are used from the buffer. At this point I would not expect openGL to copy the data into its internal memory but even if it does it is more like a temporary data cache, nothing more. These data are not reused through render calls and access to data must be done again.
You need to persist the data in your memory and they must be accessible. If they are no longer owned you may make the draw call to draw garbage or even receive a crash if you have no access to the memory you inserted. So for instance setting a pointer from a method/function which allocates the data in stack is generally a no-no.
Can I assume, that glDrawArrays will create a full copy of the buffer
on every draw call?
Graphics drivers try very hard not to copy bulk data buffers - it's horribly slow and energy intensive. The entire point of using buffers rather than client-side arrays is that they can be uploaded to the graphics server once and subsequently just referenced by the draw call without needing a copy or (on desktop) a transfer over PCIe into graphics RAM.
Rendering will behave as if the buffer has been copied (e.g. the output must reflect the state of the buffer at the point the draw call was made, even if it is subsequently modified). However, in most cases no copy is actually needed.
A badly written application can force copies to be taken; for example, if you modify a buffer (e.g. calling glBufferSubData) immediately after submitting a draw using it, the driver may need to create a new version of the buffer as the original data is likely still referenced by the draw you just queued. Well written applications try to pipeline their resource updates so this doesn't happen because it is normally fatal for application rendering performance ...
See https://community.arm.com/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources for a longer explanation.

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Resources