webgl bufferSubData call cost vs cost of transfering bytes

webgl bufferSubData call cost vs cost of transfering bytes - performance

I have an array buffer. Parts of the buffer needs to be changed, and parts do not need to be changed. If the parts of the buffer that needs change are subsequent, a call to bufferSubData ranging over the part that needs to be changed is more efficient than updating the whole buffer, including changing bytes that does not need to change. The problem is if the bytes that need changing are far apart within the buffer, with many bytes between that does not need changing. Is it better to make two bufferSubData calls for each chunk that needs updating, or is it better to just make one call that unnecessarily update the ones in between as well? How costly is a bufferSubData call versus updating one more byte of data?

Assuming bufferSubData is routed to native glBufferSubData provided by the driver, it is best to be avoided altogether. It is known to be extremely slow in some mobile GPU drivers. See this page for reference (search for glBufferSubData).
I have run into extreme glBufferSubData slowness in quite recent mobile GPUs used on Android devices (Mali, PowerVR, Adreno). Interestingly, not on PowerVR GPUs used with iOS, which clearly indicates a software issue. The practical approach which seems to run well everywhere is replacing the whole buffer with glBufferData once per frame (or as few times as possible, combining the data for multiple draw calls).

Related

DirectX11 - Buffer size of instanced vertices, with various size

With DirectX/C++, suppose you are drawing the same model many times to the screen. You can do this with DrawIndexedInstanced(). You need to set the size of the instance buffer when you create it:
D3D11_BUFFER_DESC_instance.ByteWidth = sizeof(struct_with_instance_data)* instance_count;
If the instance_count can vary between a low and high value, is it customary to create the buffer with the max value (max_instance_count)? And only draw what is required.
Wouldn't that permanently use a lot of memory?
Recreating the buffer is a slow solution?
What are good methods?
Thank you.

All methods have pros and cons.
Create max.count — as you pointed out you’ll consume extra memory.
Create some initial count, and implement exponential growth — you can OOM in runtime, also growing large buffers may cause spikes in the profiler.
There’s other way which may or may not work for you depending on the application. You can create reasonably large fixed-size buffer, to render more instances call DrawIndexedInstanced multiple times in a loop, replacing the data in the buffer between calls. Will work well if the source data is generated in runtime from something else, you’ll need to rework that part to produce fixed-size (except the last one) batches instead of the complete buffer. Won’t work if the data in the buffer needs to persist across frames, e.g. if you update it with a compute shader.

Do I need to use glMapRange with buffer orphaning to avoid stalls?

I've been told that buffer orphaning (i.e. calling glBufferData() with NULL for the final arg) allows us to avoid stalls that occur if GPU is trying to read a buffer object while CPU is trying to write to it.
What I'm not clear on is whether we can use this approach without glMapBuffer*(), or whether glMapBuffer*() is integral to the idea of orphaning and stall avoidance? I ask this purely to avoid making unnecessary changes to an existing codebase, though I understand glMapBuffer*() to be an inherently better choice than repeated glBufferData() in the long run.
(Please respond specific to OpenGL ES 2.0, unless the answer is general across GL versions.)

The answer was in the official location:
The first way is to call glBufferData with a NULL pointer, and the
exact same size and usage hints it had before. This allows the
implementation to simply reallocate storage for that buffer object
under-the-hood. Since allocating storage is (likely) faster than the
implicit synchronization, you gain significant performance advantages
over synchronization. And since you passed NULL, if there wasn't a
need for synchronization to begin with, this can be reduced to a
no-op. The old storage will still be used by the OpenGL commands that
have been sent previously. If you continue to use the same size
over-and-over, it is likely that the GL driver will not be doing any
allocation at all, but will just be pulling an old free block off the
unused buffer queue and use it (though of course this isn't
guaranteed), so it is likely to be very efficient.
You can do the same thing when using glMapBufferRange with the
GL_MAP_INVALIDATE_BUFFER_BIT. You can also use
glInvalidateBufferData, where available.
So I suppose it's safe to assume that only glBufferData() calls are needed for orphaning.

Multiple contexts per application vs multiple applications per context

I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements. Currently, applications usually have their own context, meaning whatever data might be identical across different applications, it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls. From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design. So technically, everything that might be reused across applications, and especially heavy stuff like bitmap or geometry caches for text and UI, is just clogging the server with unnecessary transfers and memory usage.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.

I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements.
That depends on the task at hand. A small detour: Take a webbrowser for example, where JavaScript performs manipulations on the DOM; CSS transform and SVG elements define graphical elements. Each JavaScript called in response to an event may run as a separate thread/lighweight process. In a matter of sense the webbrowser is a rendering engine (heck they're internally even called rendering engines) for a whole bunch of applications.
And for that it's a good idea.
And in general display servers are a very good thing. Just have a look at X11, which has an incredible track record. These days Wayland is all the hype, and a lot of people drank the Kool-Aid, but you actually want the abstraction of a display server. However not for the reasons you thought. The main reason to have a display server is to avoid redundant code (not redundant data) and to have only a single entity to deal with the dirty details (color spaces, device physical properties) and provide optimized higher order drawing primitives.
But in regard with the direct use of OpenGL none of those considerations matter:
Currently, applications usually have their own context, meaning whatever data might be identical across different applications,
So? Memory is cheap. And you don't gain performance by coalescing duplicate data, because the only thing that matters for performance is the memory bandwidth required to process this data. But that bandwidth doesn't change because it only depends on the internal structure of the data, which however is unchanged by coalescing.
In fact deduplication creates significant overhead, since when one application made changes, that are not to affect the other application a copy-on-write operation has to be invoked which is not for free, usually means a full copy, which however means that while making the whole copy the memory bandwidth is consumed.
However for a small, selected change in the data of one application, with each application having its own copy the memory bus is blocked for much shorter time.
it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls.
Resource management and rendering normally do not interfere with each other. While the GPU is busy turning scalar values into points, lines and triangles, the driver on the CPU can do the housekeeping. In fact a lot of performance is gained by keeping making the CPU do non-rendering related work while the GPU is busy rendering.
From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design
Where did you read that? There's no such constraint/requirement on this in the OpenGL specifications and real OpenGL implementations (=drivers) are free to parallelize as much as they want.
just clogging the server with unnecessary transfers and memory usage.
Transfer happens only once, when the data gets loaded. Memory bandwidth consumption is unchanged by deduplication. And memory is so cheap these days, that data deduplication simply isn't worth the effort.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I think you completely misunderstand the nature of OpenGL. OpenGL is not a scene graph. There's no scene, there are mo models in OpenGL. Each applications has its own layout of data and eventually this data gets passed into OpenGL to draw pixels onto the screen.
To OpenGL however there are just drawing commands to turn arrays of scalar values into points, lines and triangles on the screen. There's nothing more to it.

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)

First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.

Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.

Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Report Direct3D memory usage

I have a Direct3D 9 application and I would like to monitor the memory usage.
Is there a tool to know how much system and video memory is used by Direct3D?
Ideally, it would also report how much is allocated for textures, vertex buffers, index buffers...

You can use the old DirectDraw interface to query the total and available memory.
The numbers you get that way are not reliable though.
The free memory may change at any instant and the available memory often takes the AGP-memory into account (which is strictly not video-memory). You can use the numbers to do a good guess about the default texture-resolutions and detail-level of your application/game, but that's it.
You may wonder why is there no way to get better numbers, after all it can't be to hard to track the resource-usage.
From an application point of view this is correct. You may think that the video memory just contains surfaces, textures, index- and vertex buffers and some shader-programs, but that's not true on the low-level side.
There are lots of other resources as well. All these are created and managed by the Direct3D driver to make the rendering as fast as possible. Among others there are hirarchical z-buffer acceleration structures, pre-compiled command lists (e.g. the data required to render something in the format as understood by the GPU). The driver also may queue rendering-commands for multiple frames in advance to even out the frame-rate and increase parallelity between the GPU and CPU.
The driver also does a lot of work under the hood for you. Heuristics are used to detect draw-calls with static geometry and constant rendering-settings. A driver may decide to optimize the geometry in these cases for better cache-usage. This all happends in parallel and under the control of the driver. All this stuff needs space as well so the free memory may changes at any time.
However, the driver also does caching for your resources, so you don't really need to know the resource-usage at the first place.
If you need more space than available the that's no problem. The driver will move the data between system-ram, AGP-memory and video ram for you. In practice you never have to worry that you run out of video-memory. Sure - once you need more video-memory than available the performance will suffer, but that's life :-)

Two suggestions:
You can call GetAvailableTextureMem in various times to obtain a (rough) estimate of overall memory usage progression.
Assuming you develop on nVidia's, PerfHUD includes a graphical representation of consumed AGP/VID memory (separated).
You probably won't be able to obtain a nice clean matrix of memory consumers (vertex buffers etc.) vs. memory location (AGP, VID, system), as -
(1) the driver has a lot of freedom in transferring resources between memory types, and
(2) the actual variety of memory consumers is far greater than the exposed D3D interfaces.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio