How to best organize constant buffers - directx-11

I'm having some trouble wrapping my head around how to organize the constant buffers in a very basic D3D11 engine I'm making.
My main question is: Where does the biggest performance hit take place? When using Map/Unmap to update buffer data or when binding the cbuffers themselves?
At the moment, I'm deciding between the following two implementations for a sort of "shader-wrapper" class:
Holding an array of 14 ID3D11Buffer*s
class VertexShader
{
...
public:
Bind(context)
{
// Bind all 14 buffers at once
context->VSSetConstantBuffers(0, 14, &m_ppCBuffers[0]);
context->VSSetShader(pVS, nullptr, 0);
}
// Set the data for a buffer in a particular slot
SetData(slot, size, pData)
{
D3D11_MAPPED_SUBRESOURCE mappedBuffer = {};
context->Map(buffers[slot], 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedBuffer);
memcpy(mappedBuffer.pData, pData, size);
context->Unmap(buffers[slot], 0);
}
private:
ID3D11Buffer* buffers[14];
ID3D11VertexShader* pVS;
}
This approach would have the shader bind all the cbuffers in a single batch of 14. If the shader has cbuffers registered to b0, b1, b3 the array would look like -> [cb|cb|0|cb|0|0|0|0|0|0|0|0|0|0]
Constant Buffer wrapper that knows how to bind itself
class VertexShader
{
...
public:
Bind(context)
{
// all the buffers bind themselves
for(auto cb : bufferMap)
cb->Bind(context);
context->VSSetShader(pVS, nullptr, 0);
}
// Set the data for a buffer with a particular ID
SetData(std::string, size, pData)
{
// table lookup into bufferMap, then Map/Unmap
}
private:
std::unordered_map<std::string, ConstantBuffer*> bufferMap;
ID3D11VertexShader* pVS;
}
This approach would hold "ConstantBuffers" in a hash table, each one would know what slot it's bound to and how to bind itself to the pipeline. I would have to call VSSetConstantBuffers() individually for each cbuffer since the ID3D11Buffer*s wouldn't be contiguous anymore, but the organization is friendlier and has a bit less wasted space.
How would you typically organize the relationship between CBuffers, Shaders, SRVs, etc? Not looking for a do-all solution, but some general advice and things to read more about from people hopefully more experienced than I am
Also if #Chuck Walbourn sees this, I'm a fan of your work and using DXTK/WiCTextureLoader for this project!
Thanks.

Constant Buffers were a major feature of Direct3D 10, so one of the best talks on the subject was given way back at Gamefest 2007:
Windows to Reality: Getting the Most out of Direct3D 10 Graphics in Your Games
See also Why Can Updating Constant Buffers be so painfully slow? (NVIDIA)
The original intention was for CBs to be organized by frequency of update: something like one CB for stuff that is set 'per level', another for stuff 'per frame', another for 'per object', another 'per pass' etc. Therefore the assumption is that if you changed any part of a CB, you were going to be uploading the whole thing. Bandwdith between the CPU and GPU is the real bottleneck here.
For this approach to be effective, you basically need to set up all your shaders to use the same scheme. This can be difficult to manage, especially when so many modern material systems are art-driven.
Another approach to CBs is to use them like a dynamic VB for particles submission where you fill it up with short-lived constants, submit work, and then reset the thing each frame. This approach is basically what people do for DirectX 12 in many cases. The problem is that without the ability to update parts of CBs, it's too slow. The "partial constant buffer updates and offsets' optional features in DirectX 11.1 were a way to make this work. That said, this feature is not supported on Windows 7 and is 'optional' on newer versions of Windows, so you have to support two codepaths to use it.
TL;DR: you can technically have a lot of CBs bound at once, but the key thing is to keep the individual size small for the ones that change often. Also assume any change to a CB is going to require updating the whole thing to the GPU every time you do change it.

Related

Proper way to manipulate registers (PUT32 vs GPIO->ODR)

I'm learning how to use microcontrollers without a bunch of abstractions. I've read somewhere that it's better to use PUT32() and GET32() instead of volatile pointers and stuff. Why is that?
With a basic pin wiggle "benchmark," the performance of GPIO->ODR=0xFFFFFFFF seems to be about four times faster than PUT32(GPIO_ODR, 0xFFFFFFFF), as shown by the scope:
(The one with lower frequency is PUT32)
This is my code using PUT32
PUT32(0x40021034, 0x00000002); // RCC IOPENR B
PUT32(0x50000400, 0x00555555); // PB MODER
while (1) {
PUT32(0x50000414, 0x0000FFFF); // PB ODR
PUT32(0x50000414, 0x00000000);
}
This is my code using the arrow thing
* (volatile uint32_t *) 0x40021034 = 0x00000002; // RCC IOPENR B
GPIOB->MODER = 0x00555555; // PB MODER
while (1) {
GPIOB->ODR = 0x00000000; // PB ODR
GPIOB->ODR = 0x0000FFFF;
}
I shamelessly adapted the assembly for PUT32 from somewhere
PUT32 PROC
EXPORT PUT32
STR R1,[R0]
BX LR
ENDP
My questions are:
Why is one method slower when it looks like they're doing the same thing?
What's the proper or best way to interact with GPIO? (Or rather what are the pros and cons of different methods?)
Additional information:
Chip is STM32G031G8Ux, using Keil uVision IDE.
I didn't configure the clock to go as fast as it can, but it should be consistent for the two tests.
Here's my hardware setup: (Scope probe connected to the LEDs. The extra wires should have no effect here)
Thank you for your time, sorry for any misunderstandings
PUT32 is a totally non-standard method that the poster in that other question made up. They have done this to avoid the complication and possible mistakes in defining the register access methods.
When you use the standard CMSIS header files and assign to the registers in the standard way, then all the complication has already been taken care of for you by someone who has specific knowledge of the target that you are using. They have designed it in a way that makes it hard for you to make the mistakes that the PUT32 is trying to avoid, and in a way that makes the final syntax look cleaner.
The reason that writing to the registers directly is quicker is because writing to a register can take as little as a single cycle of the processor clock, whereas calling a function and then writing to the register and then returning takes four times longer in the context of your experiment.
By using this generic access method you also risk introducing bugs that are not possible if you used the manufacturer provided header files: for example using a 32 bit access when the register is 16 or 8 bits.

Will `wglCreateContext` fail safely if a DirectX context already exists for the given `HDC` (device context)

I'm attempting to contribute to a cross-platform, memory-safe API for creating and using OpenGL contexts in Rust called glutin. There is an effort to redesign the API in a way that will allow users to create a GL context for a pre-existing window.
One of the concerns raised was that this might not be memory safe if a user attempts to create a GL context for a window that has a pre-existing DirectX context.
The documentation for wglCreateContext suggests that it will return NULL upon failure, however it does not go into detail about what conditions might cause this.
Will wglCreateContext fail safely (by returning NULL) if a DirectX context already exists for the given HDC (device context)? Or is the behaviour in this situation undefined?
I do not have access to a Windows machine with OpenGL support and am unable to test this directly myself.
The real issue I see here is, that wglCreateContext may fail for any reason, and you have to be able to deal with that.
That being said, the way you formulated your question reeks of a fundamental misunderstanding between the relationship between OpenGL contexts, device contexts and windows. To put in in three simple words: There is none!
Okay, that begs for clarification. What's the deal here? Why is there a HDC parameter to wglCreateContext if these are not related?
It all boils down to pixel formats (or framebuffer configuration). A window as you can see it on the screen, is a direct 1:1 representation of a block of memory. This block of memory has a certain pixel format. However as long as only abstract drawing methods are used (like the GDI is), the precise pixel format being used doesn't matter and the graphics system may silently switch the pixel format as it sees fit. In times long begone, when graphics memory was scarce this could mean huge savings.
However OpenGL assumes to operate of framebuffers with a specific, unchanging pixel format. So in order to support that a new API was added that allows to nail down the internal pixel format of a given window. However since the only part of the graphics system that's actually concerned with the framebuffer format is the part that draws stuff, i.e. the GDI, the framebuffer configuration happens through that part. HWNDs are related to passing around messages, associating input events to applications and all that jazz. But it's HDCs that relate to everything graphics. So that's why you set the pixel format through an HDC.
When creating an OpenGL context, that context has to be configured for the graphics device it's intended to be used on. And this again goes through the GDI and data structures that are addressed through HDC handles. But once an OpenGL context has been created, it can be used with any HDC that refers to a framebuffer that has a pixel format configured that is compatible to the HDC the OpenGL context was originally created with. It can be a different HDC of the same window, or it can be a HDC of an entirely different window alltogether. And ever since OpenGL-3.3 core an OpenGL context can be made current with no HDC at all, being completely self contained, operating on self managed framebuffers. And last but not least, that binding can be changed at any time.
Everytime when people, who have no clear understanding of this, implement some OpenGL binding or abstraction wrapper, they tend to get this part wrong and create unnecessarily tight straight jackets, which then other people, like me, have to fight their way out, because the way the abstraction works is ill conceived. The Qt guys made that mistake, the GTK+ guys did so, and now it seems apparently so do you. There is this code snippet on your Github project page:
let events_loop = glutin::EventsLoop::new();
let window = glutin::WindowBuilder::new()
.with_title("Hello, world!".to_string())
.with_dimensions(1024, 768)
.with_vsync()
.build(&events_loop)
.unwrap();
unsafe {
window.make_current()
}.unwrap();
unsafe {
gl::load_with(|symbol| window.get_proc_address(symbol) as *const _);
gl::ClearColor(0.0, 1.0, 0.0, 1.0);
}
Arrrrggggh. Why the heck are the methods make_current and get_proc_address associated with the window? Why?! Who came up with this? Don't do this shit, it makes the life of people who have to use this miserable and painful.
Do you want to know to what this leads? Horribly, messy unsafe code, that does disgusting and dirty things to forcefully and bluntly disable some of the safeguards present, just so that it can go to work. Like this horrible thing I had to do, to get Qt4's ill assumptions of how OpenGL works out of the way.
#ifndef _WIN32
#if OCTPROCESSOR_CREATE_GLWIDGET_IN_THREAD
QGLFormat glformat(
QGL::DirectRendering |
QGL::DoubleBuffer |
QGL::Rgba |
QGL::AlphaChannel |
QGL::DepthBuffer,
0 );
glformat.setProfile(QGLFormat::CompatibilityProfile);
gl_hidden_widget = new QGLWidget(glformat);
if( !gl_hidden_widget ) {
qDebug() << "creating woker context failed";
goto fail_init_glcontext;
}
gl_hidden_widget->moveToThread( QThread::currentThread() );
#endif
if( !gl_hidden_widget->isValid() ) {
qDebug() << "worker context invalid";
goto fail_glcontext_valid;
}
gl_hidden_widget->makeCurrent();
#else
if( wglu_create_pbuffer_with_glrc(
3,3,WGL_CONTEXT_COMPATIBILITY_PROFILE_BIT_ARB,
&m_hpbuffer,
&m_hdc,
&m_hglrc,
&m_share_hglrc)
){
qDebug() << "failed to create worker PBuffer and OpenGL context";
goto fail_init_glcontext;
}
qDebug()
<< "m_hdc" << m_hdc
<< "m_hglrc" << m_hglrc;
if( !wglMakeCurrent(m_hdc, m_hglrc) ){
qDebug() << "failed making OpenGL context current on PBuffer HDC";
goto fail_glcontext_valid;
}
#endif

raw input handling (distinguishing secondary mouse)

I was writing some pices in winapi's raw input
It seem to working though I am not sure how reliable (unfaliable) it is
(and if it will be working on all systems machines etc, this is a bit worry)
also there appears many question, the one is
I would like to use my first (I mean normal/base mouse) in old way,
it is processint WM_MOUSEMOVE etc and moving arrow cursor, only the
secondary mouse I need processing by raw_input (primary can stay untouched by rawinput), the problem is
1) how can i be sure which mouse detected by rawinput is the
secondary?
2) the second mouse moves also my arrow -cursor, if I disable
it by RIDEV_NOLEGACY then both are not moving cursor (it bacame hourglass) and it is wrong too
think maybe i should setup it a bit differently my setrup rawinput function is like
void SetupRawInput()
{
static RAWINPUTDEVICE Rid[1];
Rid[0].usUsagePage = 0x01;
Rid[0].usUsage = 0x02;
Rid[0].dwFlags = 0; // Rid[0].dwFlags = RIDEV_NOLEGACY; /
Rid[0].hwndTarget = NULL;
int r = RegisterRawInputDevices( Rid, 1, sizeof(Rid[0]) );
if (!r) ERROR_EXIT("raw input register fail");
}
how to resolve this issueas andmake it work? tnx
I don't know if my approach is the best one, or not, but this is how I do it for the first item in your question:
When I process WM_INPUT using GetRawInputData(...), I check to see if the device handle passed back by the RAWINPUTHEADER structure (contained within the RAWINPUT structure returned from the function) is the same as the device I want to use. If it is not, then I simply don't bother sending back data, if it is, I then process the RAWINPUTMOUSE data returned in the RAWINPUT struct.
And if you're wondering how to get the list of devices, you can use GetRawInputDeviceList(...), which will return the device handles of the mice you're trying to work with.
As I said, this may not be the best way, but I have confirmed that it does work for my purposes. I also do this for my keyboard raw input data as well.
As for item #2, it seems likely that it affects both mice because Windows has exclusive access to the mice, so you can't register one specific mouse without registering them all with the same flags. But someone with more knowledge than I could probably give a better explanation.

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

API to get the graphics or video memory

I want to get the adpater RAM or graphics RAM which you can see in Display settings or Device manager using API. I am in C++ application.
I have tried seraching on net and as per my RnD I have come to conclusion that we can get the graphics memory info from
1. DirectX SDK structure called DXGI_ADAPTER_DESC. But what if I dont want to use DirectX API.
2. Win32_videocontroller : But this class does not always give you adapterRAM info if availability of video controller is offline. I have checked it on vista.
Is there any other way to get the graphics RAM?
There is NO way to directly get graphics RAM on windows, windows prevents you doing this as it maintains control over what is displayed.
You CAN, however, create a DirectX device. Get the back buffer surface and then lock it. After locking you can fill it with whatever you want and then unlock and call present. This is slow, though, as you have to copy the video memory back across the bus into main memory. Some cards also use "swizzled" formats that it has to un-swizzle as it copies. This adds further time to doing it and some cards will even ban you from doing it.
In general you want to avoid directly accessing the video card and letting windows/DirectX do the drawing for you. Under D3D1x Im' pretty sure you can do it via an IDXGIOutput though. It really is something to try and avoid though ...
You can write to a linear array via standard win32 (This example assumes C) but its quite involved.
First you need the linear array.
unsigned int* pBits = malloc( width * height );
Then you need to create a bitmap and select it to the DC.
HBITMAP hBitmap = ::CreateBitmap( width, height, 1, 32, NULL );
SelectObject( hDC, (HGDIOBJ)hBitmap );
You can then fill the pBits array as you please. When you've finished you can then set the bitmap's bits.
::SetBitmapBits( hBitmap, width * height * 4, (void*)pBits )
When you've finished using your bitmap don't forget to delete it (Using DeleteObject) AND free your linear array!
Edit: There is only one way to reliably get the video ram and that is to go through the DX Diag interfaces. Have a look at IDxDiagProvider and IDxDiagContainer in the DX SDK.
Win32_videocontroller is your best course to get the amount of gfx memory. That's how its done in Doom3 source.
You say "..availability of video controller is offline. I have checked it on vista." Under what circumstances would the video controller be offline?
Incidentally, you can find the Doom3 source here. The function you're looking for is called Sys_GetVideoRam and it's in a file called win_shared.cpp, although if you do a solution wide search it'll turn it up for you.
User mode threads cannot access memory regions and I/O mapped from hardware devices, including the framebuffer. Anyway, what you would want to do that? Suppose the case you can access the framebuffer directly: now you must handle a LOT of possible pixel formats in the framebuffer. You can assume a 32-bit RGBA or ARGB organization. There is the possibility of 15/16/24-bit displays (RGBA555, RGBA5551, RGBA4444, RGBA565, RGBA888...). That's if you don't want to also support the video-surface formats (overlays) such as YUV-based.
So let the display driver and/or the subjacent APIs to do that effort.
If you want to write to a display surface (which not equals exactly to framebuffer memory, altough it's conceptually almost the same) there are a lot of options. DX, Win32, or you may try the SDL library (libsdl).

Resources