glDrawArrays - guaranteed to copy? - opengl-es

Following scenario:
I have a buffer of vertices in system memory whose address I submit to glDrawArrays via glVertexAttribPointer every frame. The rendering API is OpenGL ES 3.0.
Now my question:
Can I assume, that glDrawArrays will create a full copy of the buffer on every draw call? Or is it possible that it will draw from the buffer directly if I'm on a shared memory platform?
Regards

For all porpoises you need to treat it as "that it will draw from the buffer directly".
When setting the pointer you do not tell the openGL about any sizes of the buffer so at that point there is nothing set at all but some pointer on the GPU or rather an integer value since the same procedure is used when using GPU vertex buffers.
So the data size is only determined when calling draw where you say how many vertices are used from the buffer. At this point I would not expect openGL to copy the data into its internal memory but even if it does it is more like a temporary data cache, nothing more. These data are not reused through render calls and access to data must be done again.
You need to persist the data in your memory and they must be accessible. If they are no longer owned you may make the draw call to draw garbage or even receive a crash if you have no access to the memory you inserted. So for instance setting a pointer from a method/function which allocates the data in stack is generally a no-no.

Can I assume, that glDrawArrays will create a full copy of the buffer
on every draw call?
Graphics drivers try very hard not to copy bulk data buffers - it's horribly slow and energy intensive. The entire point of using buffers rather than client-side arrays is that they can be uploaded to the graphics server once and subsequently just referenced by the draw call without needing a copy or (on desktop) a transfer over PCIe into graphics RAM.
Rendering will behave as if the buffer has been copied (e.g. the output must reflect the state of the buffer at the point the draw call was made, even if it is subsequently modified). However, in most cases no copy is actually needed.
A badly written application can force copies to be taken; for example, if you modify a buffer (e.g. calling glBufferSubData) immediately after submitting a draw using it, the driver may need to create a new version of the buffer as the original data is likely still referenced by the draw you just queued. Well written applications try to pipeline their resource updates so this doesn't happen because it is normally fatal for application rendering performance ...
See https://community.arm.com/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources for a longer explanation.

Related

DirectX11 - Buffer size of instanced vertices, with various size

With DirectX/C++, suppose you are drawing the same model many times to the screen. You can do this with DrawIndexedInstanced(). You need to set the size of the instance buffer when you create it:
D3D11_BUFFER_DESC_instance.ByteWidth = sizeof(struct_with_instance_data)* instance_count;
If the instance_count can vary between a low and high value, is it customary to create the buffer with the max value (max_instance_count)? And only draw what is required.
Wouldn't that permanently use a lot of memory?
Recreating the buffer is a slow solution?
What are good methods?
Thank you.
All methods have pros and cons.
Create max.count — as you pointed out you’ll consume extra memory.
Create some initial count, and implement exponential growth — you can OOM in runtime, also growing large buffers may cause spikes in the profiler.
There’s other way which may or may not work for you depending on the application. You can create reasonably large fixed-size buffer, to render more instances call DrawIndexedInstanced multiple times in a loop, replacing the data in the buffer between calls. Will work well if the source data is generated in runtime from something else, you’ll need to rework that part to produce fixed-size (except the last one) batches instead of the complete buffer. Won’t work if the data in the buffer needs to persist across frames, e.g. if you update it with a compute shader.

glBufferSubData is very slow on many android device

I have requested about 2M gl buffers for share, and update the data for vertex and index with glBufferSubData, it works fine on my iOS devices. while, when I test it on my android devices, it very very slow.
I have found some notes from the official website:
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferSubData.xhtml
what does "that rendering must drain from the pipeline before the data store can be updated" really mean?
The performance difference you're seeing is likely not simply an iOS/Android difference but will be very specific to both your usage of the API and the implementation of glBufferSubData in the driver. Without seeing more code, or knowing what performance metrics you're gathering, it's hard to comment further.
what does "that rendering must drain from the pipeline before the data
store can be updated" really mean?
The idea here is that whilst the OpenGL API gives the illusion that each command is executed to completion before continuing, in fact, drawing is generally batched up and done asynchronously in the background. The problem here is that glBufferSubData is potentially adding a synchronisation point, which will mean that the driver will have to stall until all previous rendering using that buffer has completed before continuing.
Consider the following example. In a good case, we might have something like this:
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 2 with FGHIJ
Draw call using buffer 2
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
However if you're overwriting the same buffer, you will get this instead.
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 1, overwriting with FGHIJ <----- Synchronisation point, as the driver must ensure that the buffer has finished being used by first draw call before modifying the data
Draw call using updated buffer 1
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
As you can see, you can potentially end up with a second synchronisation point. However as mentioned before, this is somewhat driver specific. For example some drivers might be able to detect the case where the section of the buffer you're updating isn't in use by the previous draw call, whilst others might not. Something of this nature is probably what's causing the performance difference you're seeing.

webgl bufferSubData call cost vs cost of transfering bytes

I have an array buffer. Parts of the buffer needs to be changed, and parts do not need to be changed. If the parts of the buffer that needs change are subsequent, a call to bufferSubData ranging over the part that needs to be changed is more efficient than updating the whole buffer, including changing bytes that does not need to change. The problem is if the bytes that need changing are far apart within the buffer, with many bytes between that does not need changing. Is it better to make two bufferSubData calls for each chunk that needs updating, or is it better to just make one call that unnecessarily update the ones in between as well? How costly is a bufferSubData call versus updating one more byte of data?
Assuming bufferSubData is routed to native glBufferSubData provided by the driver, it is best to be avoided altogether. It is known to be extremely slow in some mobile GPU drivers. See this page for reference (search for glBufferSubData).
I have run into extreme glBufferSubData slowness in quite recent mobile GPUs used on Android devices (Mali, PowerVR, Adreno). Interestingly, not on PowerVR GPUs used with iOS, which clearly indicates a software issue. The practical approach which seems to run well everywhere is replacing the whole buffer with glBufferData once per frame (or as few times as possible, combining the data for multiple draw calls).

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Resources