DirectX11 - Buffer size of instanced vertices, with various size - directx-11

With DirectX/C++, suppose you are drawing the same model many times to the screen. You can do this with DrawIndexedInstanced(). You need to set the size of the instance buffer when you create it:
D3D11_BUFFER_DESC_instance.ByteWidth = sizeof(struct_with_instance_data)* instance_count;
If the instance_count can vary between a low and high value, is it customary to create the buffer with the max value (max_instance_count)? And only draw what is required.
Wouldn't that permanently use a lot of memory?
Recreating the buffer is a slow solution?
What are good methods?
Thank you.

All methods have pros and cons.
Create max.count — as you pointed out you’ll consume extra memory.
Create some initial count, and implement exponential growth — you can OOM in runtime, also growing large buffers may cause spikes in the profiler.
There’s other way which may or may not work for you depending on the application. You can create reasonably large fixed-size buffer, to render more instances call DrawIndexedInstanced multiple times in a loop, replacing the data in the buffer between calls. Will work well if the source data is generated in runtime from something else, you’ll need to rework that part to produce fixed-size (except the last one) batches instead of the complete buffer. Won’t work if the data in the buffer needs to persist across frames, e.g. if you update it with a compute shader.

Related

Memory footprint of VAOs

Can someone tell me how large VAOs are in cpu/gpu memory compared to VBOs? My plan was to allocate a large number of VAOs at program start as a pool, then assign them to certain render calls as needed.
Regards
VAOs are just meta-data - buffer bindings and offsets. VBOs are actual data buffers containing all of your vertex data. VAOs will be orders of magnitude smaller than VBOs.
That said, why do you want to create a pool of them? That implies you'll be recreating them at draw time, which somewhat defeats the point. The only real advantage of VAOs vs just direct binds and offsets is that you don't have the runtime cost of recreating them all the time ...

webgl bufferSubData call cost vs cost of transfering bytes

I have an array buffer. Parts of the buffer needs to be changed, and parts do not need to be changed. If the parts of the buffer that needs change are subsequent, a call to bufferSubData ranging over the part that needs to be changed is more efficient than updating the whole buffer, including changing bytes that does not need to change. The problem is if the bytes that need changing are far apart within the buffer, with many bytes between that does not need changing. Is it better to make two bufferSubData calls for each chunk that needs updating, or is it better to just make one call that unnecessarily update the ones in between as well? How costly is a bufferSubData call versus updating one more byte of data?
Assuming bufferSubData is routed to native glBufferSubData provided by the driver, it is best to be avoided altogether. It is known to be extremely slow in some mobile GPU drivers. See this page for reference (search for glBufferSubData).
I have run into extreme glBufferSubData slowness in quite recent mobile GPUs used on Android devices (Mali, PowerVR, Adreno). Interestingly, not on PowerVR GPUs used with iOS, which clearly indicates a software issue. The practical approach which seems to run well everywhere is replacing the whole buffer with glBufferData once per frame (or as few times as possible, combining the data for multiple draw calls).

Better to create a large number of static VBOs for static data up front, or stream data into the VBOs?

I need to render meshes that can be very dense. The vertex data is completely static. Apparently large VBOs are not great for performance (source: http://www.opengl.org/wiki/Vertex_Specification_Best_Practices#Size_of_a_VBO.2FIBO, although unfortunately it fails to link to its own source.) And even if large buffers were OK for performance, the total size of my vertex data sometimes exceeds what I can successfully allocate with glBufferData().
So assuming that I need to break my mesh into smaller VBOs of a few MB each, which of these methods is recommended:
1) Allocate enough buffers at startup to hold all of the mesh data. Rendering is then as simple as binding each buffer one at a time and calling glDrawArrays().
2) Allocate a small fixed pool of buffers at startup. Rendering would require filling up a buffer with a block of triangles, calling glDrawArrays(), fill up another buffer with the next block, call glDrawArrays() again, and so on. So potentially a lot more CPU work.
3) Some other method I'm not thinking of.
Part of my question just comes down to how memory allocation with VBOs works — if I allocate enough small buffers at startup to hold all of the mesh data, am I going to run into the same memory limit that would prevent me from allocating a single buffer large enough to hold all of the data? Or are VBOs integrated with virtual memory such that OpenGL will handle swapping out VBOs when I exceed available graphics memory?
Finally, how much of this is actually implementation-dependent? Are there useful references from AMD/Intel/NVidia that explain best practices for buffer managment?
I suggest you continue to allocate as few static buffers with as many vertices each as possible. If performance begins to suffer, you need to scale down the quality level (ie number of vertices) of your model.
Actually, using too many SMALL buffers are bad for performance, because each render involves a certain amount of fixed overhead. The more vertices in a single buffer (within reason), the less number of times that initial setup needs done.
The reason large (as in very large) VBOs might not be good, is because video cards have limited performance power / memory capacity, and swapping large amounts of memory back and forth between the system RAM and video RAM can be intensive... NOT because another mechanism is better for rendering a very large model. I can't imagine why you are failing on glBufferData() unless your requirements are incredibly huge (vastly larger than 256 MB), or perhaps you are using the USHORT data type for indices, but more than 65000 allocated.
The 1MB - 4MB suggestion has to do with how you should optimize your models, and is really somewhat dependent on the generation of video card you are using. Most video cards support HUGE VBOs. I'm betting your problem is elsewhere.

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

in-place realloc with gcc/linux

Is there such a thing? I mean some function that would reallocate memory without moving it if possible or do nothing if not possible. In Visual C there is _expand which does what I want. Does anybody know about equivalents for other platforms, gcc/linux in particular? I'm mostly interested in shrinking memory in-place when possible (and standard realloc may move memory even when its size decreases, in case somebody asks).
I know there is no standard way to do this, and I'm explicitly asking for implementation-dependent dirty hackish tricks. List anything you know that works somewhere.
Aside from using mmap and munmap to eliminate the excess you don't need (or mremap, which could do the same but is non-standard), there is no way to reduce the size of an allocated block of memory. And mmap has page granularity (normally 4k) so unless you're dealing with very large objects, using it would be worse than just leaving the over-sized objects and not shrinking them at all.
With that said, shrinking memory in-place is probably not a good idea, since the freed memory will be badly fragmented. A good realloc implementation will want to move blocks when significantly shrinking them as an opportunity to defragment memory.
I would guess your situation is that you have an allocated block of memory with lots of other structures holding pointers into it, and you don't want to invalidate those pointers. If this is the case, here is a possible general solution:
Break your resizable object up into two allocations, a "head" object of fixed size which points to the second variable-sized object.
For other objects which need to point into the variable-size object, store a pointer to the head object and an integer offset (size_t or ptrdiff_t) into the variable-size object.
Now, even if the variable-size object moves to a new address, none of the references to it are invalidated.
If you're using these objects from multiple threads, you should put a read-write lock in the head object, read-locking it whenever you need to access the variable-sized object, and write-locking it whenever resizing the variable-sized object.
A similar question was asked on another forum. One of the more reasonable answers I saw involved using mmap for the initial allocation (using the MAP_ANONYMOUS flag) and calling mremap without the MREMAP_MAYMOVE flag. A limitation of this approach, though, is that the allocation sizes must be exact multiples to the system's page size.

Resources