I use the combination of DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT, GetFrameLatencyWaitableObject() and SetMaximumFrameLatency(UINT MaxLatency) to control the input lag vs. smoothness of my application as explained at https://learn.microsoft.com/en-us/windows/uwp/gaming/reduce-latency-with-dxgi-1-3-swap-chains. A value of 1 gives the lowest input lag, but sometimes I need a higher value to reduce jitter/stutter/slowdown caused by cpu and gpu cannot really work in parallel when the value is 1.
I want to be able to dynamically change this value based on the required input lag vs smoothness trade-off.
The problem I have noticed is that while it's possible to, between frames, increase this value by calling SetMaximumFrameLatency with a higher value than set before, I see no effect when decreasing this value by calling the function again with a lower value than the maximum value ever set for this swap chain by a previous call to the same function. So if I ever set it to 2, it is not possible to set it to 1 later. Is this a bug or undocumented "feature"? Or did I do something wrong?
The API itself does not return any error or similar; from the API point of view it appears to apply the new lower value correctly.
To test this, I have BufferCount = 16 and then adjust the max latency value from 1 to 16 which makes the current latency obvious to the eye. It's therefore apparent that dxgi does not apply new lower values.
I've tried to call functions in different orders, close the handle for the waitable object and recreate a new one when modifying the latency, but nothing works. The only workaround so far I'm aware of is to fully recreate the swap chain, which is annoying due to the requirement to unbind all context objects etc.
When initializing the game, I create the swap chain and set an initial latency using SetMaximumFrameLatency.
The game loop is then basically this:
Call WaitForSingleObject on the waitable object handle.
Process inputs.
Render and present a frame.
If it's decided that the latency should change at this point, call SetMaximumFrameLatency with the new value.
Other info:
Renderer: Direct3D 11
OS: Windows 11 21H2 version 22000.675
Graphics card: Intel UHD Graphics 620 / Nvidia GeForce MX150 (tried with both cards) with latest drivers, supporting WDDM 3.0
App type: Win32 desktop application
Related
I'm using append/consume buffers to reduce shading work in my path-tracer (immediately shade empty space + emitters in a pre-pass, append the remaining pixels for full processing), and I've heard that I should be using a UAV when I'm accessing through AppendStructuredBuffer<T> and an SRV when I'm accessing through ConsumeStructuredBuffer. I haven't seen that claim in any of Microsoft's documentation, but it might explain why my calls to [Consume()] are returning empty data - is it accurate?
I should have tested myself before I asked - shaders with ConsumeStructuredBuffers declared in SRV registers (tN) fail to compile and emit an error saying they're only bindable through UAVs.
My AppendStructuredBuffer bound through the UAV registers works fine, so it seems like I was just quoting hearsay; unordered-access-views should be used for both HLSL types.
I know that glutMainLoop() is used to call display over and over again, maintaining a constant frame rate. At the same time, if I also have glutTimerFunc(), which calls glutPostRedisplay() at the end, so it can maintain a different framerate.
When they are working together, what really happens ? Does the timer function add on to the framerate of main loop and make it faster ? Or does it change the default refresh rate of main loop ? How do they work in conjunction ?
I know that glutMainLoop() is used to call display over and over again, maintaining a constant frame rate.
Nope! That's not what glutMainLoop does. The purpose of glutMainLoop is to pull operating system events, check if timers elapsed, see if windows have to be redrawn and then call into the respective callback functions registered by the user. This happens in a loop and usually this loop is started from the main entry point of the program, hence the name "main - loop".
When they are working together, what really happens ? Does the timer function add on to the framerate of main loop and make it faster ? Or does it change the default refresh rate of main loop ? How do they work in conjunction?
As already told, dispatching timers is part of the responsibility of glutMainLoop, so you can't have GLUT timers without that. More importantly if there happened no events and no re-display was posted and if there's not idle function registerd, glutMainLoop will "block" the program until some interesting happens (i.e. no CPU cycles are being consumed).
Essentially it goes like
void glutMainLoop(void)
{
for(;;){
/* ... */
foreach(t in timers){
if( t.elapsed() ){
t.callback(…);
continue;
}
}
/* ... */
if( display.posted ){
display.callback();
display.posted = false;
continue;
}
idle.callback();
}
}
At the same time, if I also have glutTimerFunc(), which calls glutPostRedisplay() at the end, so it can maintain a different framerate.
The timers provided by GLUT make no guarantees about their precision and jitter. Hence they're not particularly well suited for framerate limiting.
Normally the framerate is limited by v-sync (or it should be), but blocking on v-sync means you can not use that time to do something usefull, because the process is blockd. A better approach is to register an idle function, in which you poll a high resolution timer (on POSIX compliant systems clock_gettime(CLOCK_MONOTONIC, …), on Windows QueryPerformanceCounter) and perform a glutPostRedisplay after one display refresh interval minus the time required for rendering the frame elapsed.
Of course it's hard to predict how long rendering is going to take exactly, so the usual approach is to collect sliding window average and deviation and adjust with that. Also you want to align that timer with v-sync.
This is of course a solved problem (at least in electrical engineering) which can be addressed by a Phase Locked Loop. Essentially you have a "phase comparator" (i.e. something that compares if your timer runs slower or faster than something you want synchronize to), a "charge pump" (a variable you add to or subtract from the delta from the phase comparator), a "loop filter" (sliding window average) and an "oscillator" (a timer) controlled by the loop filtered value in the charge pump.
So you poll the status of the v-sync (not possible with GLUT functions, and not even possible with core OpenGL or even some of the swap control extensions – you'll have to use OS specific functions for that) and compare if your timers lag beind or run fast compared to that. You add that delta to the "charge pump", filter it and feed the result back into the timer. The nice thing about this approach is, that this will automatically adjust to and filter the time spent for rendering frames as well.
From the glutMainLoop doc pages:
glutMainLoop enters the GLUT event processing loop. This routine should be called at most once in a GLUT program. Once called, this routine will never return. It will call as necessary any callbacks that have been registered. (grifos mine)
That means that the idea of glutMainLoop is just processing events, calling anything that is installed. Indeed, I do not believe that it keeps calling display over and over, but only when there is an event that request its redisplay.
This is where glutTimerFunc() comes into the play. It register a timer event callback to be called by glutMainLoop when this event is triggered. Note that this is one of several possible others event callbacks that can be registered. That explains why in doc they use the expression at least.
(...) glutTimerFunc registers the timer callback func to be triggered in at least msecs milliseconds. (...)
I added a new GL renderer to my engine, which uses the core profile. While it runs fine on Windows and/or nvidia cards, it is like 10 times slower on OS X (3 fps instead of 30). The weird thing is, that my compatibility profile renderer runs fine.
I collected some traces with Instruments and the GL profiler:
https://www.dropbox.com/sh/311fg9wu0zrarzm/31CGvUcf2q
It shows that the application spends its time in glDrawRangeElements.
I tried the following things:
use glDrawElements instead (no effect)
flip culling (no effect on speed)
disable some GL_DYNAMIC_DRAW buffers (no effect)
bind index buffer after VAO when drawing (no effect)
converted indices to 4 byte (no effect)
use GL_BGRA textures (no effect)
What I didn't try is to align my vertices to 16 byte boundary and/or convert indices to 4 byte, but seriously, if that would be the issue then why the hell does the standard allow it?
I'm creating the context like this:
NSOpenGLPixelFormatAttribute attributes[] =
{
NSOpenGLPFAColorSize, 24,
NSOpenGLPFAAlphaSize, 8,
NSOpenGLPFADepthSize, 24,
NSOpenGLPFAStencilSize, 8,
NSOpenGLPFADoubleBuffer,
NSOpenGLPFAAccelerated,
NSOpenGLPFANoRecovery,
NSOpenGLPFAOpenGLProfile, NSOpenGLProfileVersion3_2Core,
0
};
NSOpenGLPixelFormat* format = [[NSOpenGLPixelFormat alloc] initWithAttributes:attributes];
NSOpenGLContext* context = [[NSOpenGLContext alloc] initWithFormat:format shareContext:nil];
[self.view setOpenGLContext:context];
[context makeCurrentContext];
Tried on the following specs:
radeon 6630M, OS X 10.7.5
radeon 6750M, OS X 10.7.5
geforce GT 330M, OS X 10.8.3
Do you have any ideas what I might do wrong? Again, it works fine with the compatibility profile (not using VAOs though).
UPDATE: reported to Apple.
UPDATE: Apple doesn't give a damn to the problem...anyway I created a small test program which is actually good. Now I compared the call stack with Instruments, and found out that when using the engine, glDrawRangeElements does two calls:
gleDrawArraysOrElements_ExecCore
gleDrawArraysOrElements_Entries_Body
while in the test program it calls only the second. Now the first call does something like an immediate mode render (gleFlushPrimitivesTCLFunc, gleRunVertexSubmitterImmediate), so obviously casues the slowdown.
Finally, I was able to reproduce the slowdown. This is just crazy... It is clearly caused by glBindAttribLocation being called on the "my_Position" attribute. Now I did some testing:
1 is default (as returned by glGetAttribLocation)
if I set it to zero, theres no problem
if I set it to 1, the rendering becomes slow
if I set it to any larger number, it is slow again
Obviously I relink the program (check code). It is not a problem in the implementation, I tested it with "normal" values too.
Test program:
https://www.dropbox.com/s/dgg48g1fwgyc5h0/SLOWDOWN_REPRO.zip
How to repro:
open with XCode
open common/glext.h (don't be disturbed by the name)
modify the GLDECLUSAGE_POSITION constant from 0 to 1
compile and run => slow
changing back to zero => good
I have managed to get myself the same problem in the following circumstance under
OS X Mavericks:
Instanced rendering using array buffers to give each instance its own modelToWorld and inverseNormal matrices; attribute locations are being specified through layout rather than using glGetAttribLocation
leaving one of these array buffers unused in the shader, where location is declared but the attribute isn't actually used for anything in the glsl code
In this case, a call to glDrawElementsInstanced takes up a LOT of CPU time (under normal circumstances, this call uses nearly zero CPU even when drawing several thousand instances).
You can tell that you're getting this specific problem if almost all of the CPU time used within glDrawElementsInstanced is spent in gleDrawArraysOrElements_ExecCore. Making sure that all of the array buffers are actually referenced in your shader code fixes the CPU time back to (nearly) zero.
I suspect that this is one of the situations where leaving a variable out of your main() in glsl confuses the compiler in to deleting all reference to that variable, leaving you with a dangling reference to an attribute or uniform.
I'm writing a rendering app that communicates with an image processor as a sort of virtual camera, and I'm trying to figure out the fastest way to write the texture data from one process to the awaiting image buffer in the other.
Theoretically I think it should be possible with 1 DirectX copy from VRAM directly to the area of memory I want it in, but I can't figure out how to specify a region of memory for a texture to occupy, and thus must perform an additional memcpy. DX9 or DX11 solutions would be welcome.
So far, the docs here: http://msdn.microsoft.com/en-us/library/windows/desktop/bb174363(v=vs.85).aspx have held the most promise.
"In Windows Vista CreateTexture can create a texture from a system memory pointer allowing the application more flexibility over the use, allocation and deletion of the system memory"
I'm running on Windows 7 with the June 2010 Directx SDK, However, whenever I try and use the function in the way it specifies, I the function fails with an invalid arguments error code. Here is the call I tried as a test:
static char s_TextureBuffer[640*480*4]; //larger than needed
void* p = (void*)s_TextureBuffer;
HRESULT res = g_D3D9Device->CreateTexture(640,480,1,0, D3DFORMAT::D3DFMT_L8, D3DPOOL::D3DPOOL_SYSTEMMEM, &g_ReadTexture, (void**)p);
I tried with several different texture formats, but with no luck. I've begun looking into DX11 solutions, it's going slowly since I'm used to DX9. Thanks!
I want to get the adpater RAM or graphics RAM which you can see in Display settings or Device manager using API. I am in C++ application.
I have tried seraching on net and as per my RnD I have come to conclusion that we can get the graphics memory info from
1. DirectX SDK structure called DXGI_ADAPTER_DESC. But what if I dont want to use DirectX API.
2. Win32_videocontroller : But this class does not always give you adapterRAM info if availability of video controller is offline. I have checked it on vista.
Is there any other way to get the graphics RAM?
There is NO way to directly get graphics RAM on windows, windows prevents you doing this as it maintains control over what is displayed.
You CAN, however, create a DirectX device. Get the back buffer surface and then lock it. After locking you can fill it with whatever you want and then unlock and call present. This is slow, though, as you have to copy the video memory back across the bus into main memory. Some cards also use "swizzled" formats that it has to un-swizzle as it copies. This adds further time to doing it and some cards will even ban you from doing it.
In general you want to avoid directly accessing the video card and letting windows/DirectX do the drawing for you. Under D3D1x Im' pretty sure you can do it via an IDXGIOutput though. It really is something to try and avoid though ...
You can write to a linear array via standard win32 (This example assumes C) but its quite involved.
First you need the linear array.
unsigned int* pBits = malloc( width * height );
Then you need to create a bitmap and select it to the DC.
HBITMAP hBitmap = ::CreateBitmap( width, height, 1, 32, NULL );
SelectObject( hDC, (HGDIOBJ)hBitmap );
You can then fill the pBits array as you please. When you've finished you can then set the bitmap's bits.
::SetBitmapBits( hBitmap, width * height * 4, (void*)pBits )
When you've finished using your bitmap don't forget to delete it (Using DeleteObject) AND free your linear array!
Edit: There is only one way to reliably get the video ram and that is to go through the DX Diag interfaces. Have a look at IDxDiagProvider and IDxDiagContainer in the DX SDK.
Win32_videocontroller is your best course to get the amount of gfx memory. That's how its done in Doom3 source.
You say "..availability of video controller is offline. I have checked it on vista." Under what circumstances would the video controller be offline?
Incidentally, you can find the Doom3 source here. The function you're looking for is called Sys_GetVideoRam and it's in a file called win_shared.cpp, although if you do a solution wide search it'll turn it up for you.
User mode threads cannot access memory regions and I/O mapped from hardware devices, including the framebuffer. Anyway, what you would want to do that? Suppose the case you can access the framebuffer directly: now you must handle a LOT of possible pixel formats in the framebuffer. You can assume a 32-bit RGBA or ARGB organization. There is the possibility of 15/16/24-bit displays (RGBA555, RGBA5551, RGBA4444, RGBA565, RGBA888...). That's if you don't want to also support the video-surface formats (overlays) such as YUV-based.
So let the display driver and/or the subjacent APIs to do that effort.
If you want to write to a display surface (which not equals exactly to framebuffer memory, altough it's conceptually almost the same) there are a lot of options. DX, Win32, or you may try the SDL library (libsdl).