glBufferSubData is very slow on many android device - opengl-es

I have requested about 2M gl buffers for share, and update the data for vertex and index with glBufferSubData, it works fine on my iOS devices. while, when I test it on my android devices, it very very slow.
I have found some notes from the official website:
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferSubData.xhtml
what does "that rendering must drain from the pipeline before the data store can be updated" really mean?

The performance difference you're seeing is likely not simply an iOS/Android difference but will be very specific to both your usage of the API and the implementation of glBufferSubData in the driver. Without seeing more code, or knowing what performance metrics you're gathering, it's hard to comment further.
what does "that rendering must drain from the pipeline before the data
store can be updated" really mean?
The idea here is that whilst the OpenGL API gives the illusion that each command is executed to completion before continuing, in fact, drawing is generally batched up and done asynchronously in the background. The problem here is that glBufferSubData is potentially adding a synchronisation point, which will mean that the driver will have to stall until all previous rendering using that buffer has completed before continuing.
Consider the following example. In a good case, we might have something like this:
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 2 with FGHIJ
Draw call using buffer 2
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
However if you're overwriting the same buffer, you will get this instead.
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 1, overwriting with FGHIJ <----- Synchronisation point, as the driver must ensure that the buffer has finished being used by first draw call before modifying the data
Draw call using updated buffer 1
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
As you can see, you can potentially end up with a second synchronisation point. However as mentioned before, this is somewhat driver specific. For example some drivers might be able to detect the case where the section of the buffer you're updating isn't in use by the previous draw call, whilst others might not. Something of this nature is probably what's causing the performance difference you're seeing.

Related

webgl bufferSubData call cost vs cost of transfering bytes

I have an array buffer. Parts of the buffer needs to be changed, and parts do not need to be changed. If the parts of the buffer that needs change are subsequent, a call to bufferSubData ranging over the part that needs to be changed is more efficient than updating the whole buffer, including changing bytes that does not need to change. The problem is if the bytes that need changing are far apart within the buffer, with many bytes between that does not need changing. Is it better to make two bufferSubData calls for each chunk that needs updating, or is it better to just make one call that unnecessarily update the ones in between as well? How costly is a bufferSubData call versus updating one more byte of data?
Assuming bufferSubData is routed to native glBufferSubData provided by the driver, it is best to be avoided altogether. It is known to be extremely slow in some mobile GPU drivers. See this page for reference (search for glBufferSubData).
I have run into extreme glBufferSubData slowness in quite recent mobile GPUs used on Android devices (Mali, PowerVR, Adreno). Interestingly, not on PowerVR GPUs used with iOS, which clearly indicates a software issue. The practical approach which seems to run well everywhere is replacing the whole buffer with glBufferData once per frame (or as few times as possible, combining the data for multiple draw calls).

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

Delayed reading from generated texture in OpenGL

I use a shader program to generate some data in a texture in OpenGL, and I want to read the data back in from OpenGL to use it on the CPU. Normally, of course, reading texture data involves flushing the pipeline, so that the data is actually there and ready, with obvious consequences for performance.
However, I don't actually need the data immediately, and could precisely as well wait until it's ready and then read it. Is there any way to do this? I guess I could perhaps wait until I'm swapping buffers anyway and read the data then, but would this cause any performance issues anyway (due to having to flush twice or something)? Is there any other way to do it?
Asynchronous image data transfers can be done with Pixel Buffer Objects. The Idea is that you create a PBO and initiate the texutre readback into it, and the GL will do the transfer asynchronously. It will only have to force a sync if you try to access the PBO before the transfer is completed.
You could further combine this with a fence sync object via glFenceSync() and actually query if the transfer has been completed before trying to map/read back the PBO, and if not so, do something else on the CPU instead of wasting time waiting.

DirectX 11 Swap Chain with 7 back buffers

I have a propietary media player that runs on Windows 8 in desktop mode. Runtime DirectX version is 11, but native graphics driver support is for DirectX 9.
On some computers with the exact same setup, I see that the actual swap chain's back buffers count is 2, and the performance is great, and on some others the back buffer count is 7 and there are frames dropped.
I don't have the source code of that player and wonder what could be the reason for determining the different back buffer count number in runtime.
Can someone please explain why such backbuffer count leads to such change in performance? Or just point me to relevant documentation that explains the implications of backbuffers number?
(More debugging info: Using GPUView I see that when backbuffer count is 2 the hardware works in a synchronized mode, i.e. one packet in the HW queue in each second VSync (Clip frame rate is 30fps), when for the 7 backbuffers the work is done for 5-7 frames together, then some empty VSyncs, then 5-7 frames again and so on).
Thank you in advance!
I don't really see the use of having more than 4 buffers (quad buffering, which is used for stereoscopy). Most applications use 2 buffers (double buffering) so that the application can start drawing the next frame to the second (back) buffer while the first (front) buffer is being presented to the monitor, otherwise, the application will have to wait until the front buffer is finished drawing to the screen before it can start drawing the next frame. Triple buffering just expands on this idea, so that there are two back buffers. This way, if the application is able to finish drawing an entire buffer faster than the front buffer takes to be presented to the screen, then it can start drawing the next frame to the third buffer instead of waiting for the front buffer to finish presenting.
I'm not sure if that really answers your question about other apps using 7 buffers, but again i don't think there's a need, since monitors only refresh at a rate of 60 to 75Hz usually.
If your application is running that fast that it is able to draw 2 buffers before the first buffer is finished presenting, just put the app to sleep until the front buffer is finished to give some other programs a chance to use the cpu, or spend that extra time doing some other processing for your app. If it's a media player, you could spend that extra time doing some more expensive operations to increase the quality of the media's playback.
here's a link describing buffering though, but they don't talk about more than 4 buffers, probably because there is no need.
http://en.wikipedia.org/wiki/Multiple_buffering
P.S.
maybe the reason why the application probably loses some frame rate when using like 7 buffers, is because the application probably can't keep up writing to all of the buffers before they need to be presented to the screen. This probably wouldn't be the case if multi-threading was being used, because then the next buffer could be presented to the screen before the app finished drawing to all the other back buffers.
Well, I got an answer from Microsoft. This is in order to save power when working on DC (battery) - that way the processor can be awake for processing all available buffers, send them to GPU to work on and move to a deeper power saving mode for a longer time.

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Resources