DirectX 11 Swap Chain with 7 back buffers - performance

I have a propietary media player that runs on Windows 8 in desktop mode. Runtime DirectX version is 11, but native graphics driver support is for DirectX 9.
On some computers with the exact same setup, I see that the actual swap chain's back buffers count is 2, and the performance is great, and on some others the back buffer count is 7 and there are frames dropped.
I don't have the source code of that player and wonder what could be the reason for determining the different back buffer count number in runtime.
Can someone please explain why such backbuffer count leads to such change in performance? Or just point me to relevant documentation that explains the implications of backbuffers number?
(More debugging info: Using GPUView I see that when backbuffer count is 2 the hardware works in a synchronized mode, i.e. one packet in the HW queue in each second VSync (Clip frame rate is 30fps), when for the 7 backbuffers the work is done for 5-7 frames together, then some empty VSyncs, then 5-7 frames again and so on).
Thank you in advance!

I don't really see the use of having more than 4 buffers (quad buffering, which is used for stereoscopy). Most applications use 2 buffers (double buffering) so that the application can start drawing the next frame to the second (back) buffer while the first (front) buffer is being presented to the monitor, otherwise, the application will have to wait until the front buffer is finished drawing to the screen before it can start drawing the next frame. Triple buffering just expands on this idea, so that there are two back buffers. This way, if the application is able to finish drawing an entire buffer faster than the front buffer takes to be presented to the screen, then it can start drawing the next frame to the third buffer instead of waiting for the front buffer to finish presenting.
I'm not sure if that really answers your question about other apps using 7 buffers, but again i don't think there's a need, since monitors only refresh at a rate of 60 to 75Hz usually.
If your application is running that fast that it is able to draw 2 buffers before the first buffer is finished presenting, just put the app to sleep until the front buffer is finished to give some other programs a chance to use the cpu, or spend that extra time doing some other processing for your app. If it's a media player, you could spend that extra time doing some more expensive operations to increase the quality of the media's playback.
here's a link describing buffering though, but they don't talk about more than 4 buffers, probably because there is no need.
http://en.wikipedia.org/wiki/Multiple_buffering
P.S.
maybe the reason why the application probably loses some frame rate when using like 7 buffers, is because the application probably can't keep up writing to all of the buffers before they need to be presented to the screen. This probably wouldn't be the case if multi-threading was being used, because then the next buffer could be presented to the screen before the app finished drawing to all the other back buffers.

Well, I got an answer from Microsoft. This is in order to save power when working on DC (battery) - that way the processor can be awake for processing all available buffers, send them to GPU to work on and move to a deeper power saving mode for a longer time.

Related

Synchronizing with monitor refreshes without vsync

What is the preferred way of synchronizing with monitor refreshes, when vsync is not an option? We enable vsync, however, some users disable it in driver settings, and those override app preferences. We need reliable predictable frame lengths to simulate the world correctly, do some visual effects, and synchronize audio (more precisely, we need to estimate how long a frame is going to be on screen, and when it will be on screen).
Is there any way to force drivers to enable vsync despite what the user set in the driver? Or to ask Windows when a monitor rerfesh is going to happen? We have issues with manual sleeping when our frame boundaries line up closely to vblank. It causes occasional missed frames, and up to 1 extra frame of input latency.
We mainly use OpenGL, but Direct3D advice is also appreciated.
You should not build your application's timing on the basis of vsync and exact timings of frame presentation. Games don't do that these days and have not do so for quite some time. This is what allows them to keep a consistent speed even if they start dropping frames; because their timing, physics computations, AI, etc isn't based on when a frame gets displayed but instead on actual timing.
Game frame timings are typically sufficiently small (less than 50ms) that human beings cannot detect any audio/video synchronization issues. So if you want to display an image that should have a sound played alongside it, as long as the sound starts within about 30ms or so of the image, you're fine.
Oh and don't bother trying to switch to Vulkan/D3D12 to resolve this problem. They don't. Vulkan in particular decouples presentation from other tasks, making it basically impossible to know the exact time when an image starts appearing on the screen. You give Vulkan an image, and it presents it... at whatever is the next most opportune moment. You get some control over how that moment gets chosen, but even those choices can be restricted based on factors outside of your control.
Design your program to avoid the need for rigid vsync. Use internal timings instead.

glBufferSubData is very slow on many android device

I have requested about 2M gl buffers for share, and update the data for vertex and index with glBufferSubData, it works fine on my iOS devices. while, when I test it on my android devices, it very very slow.
I have found some notes from the official website:
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glBufferSubData.xhtml
what does "that rendering must drain from the pipeline before the data store can be updated" really mean?
The performance difference you're seeing is likely not simply an iOS/Android difference but will be very specific to both your usage of the API and the implementation of glBufferSubData in the driver. Without seeing more code, or knowing what performance metrics you're gathering, it's hard to comment further.
what does "that rendering must drain from the pipeline before the data
store can be updated" really mean?
The idea here is that whilst the OpenGL API gives the illusion that each command is executed to completion before continuing, in fact, drawing is generally batched up and done asynchronously in the background. The problem here is that glBufferSubData is potentially adding a synchronisation point, which will mean that the driver will have to stall until all previous rendering using that buffer has completed before continuing.
Consider the following example. In a good case, we might have something like this:
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 2 with FGHIJ
Draw call using buffer 2
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
However if you're overwriting the same buffer, you will get this instead.
glBufferSubData into buffer 1 with ABCDE
Draw call using buffer 1
glBufferSubData into buffer 1, overwriting with FGHIJ <----- Synchronisation point, as the driver must ensure that the buffer has finished being used by first draw call before modifying the data
Draw call using updated buffer 1
Swap buffers <----- Synchronisation point, the driver must wait for rendering to finish before swapping the buffers
As you can see, you can potentially end up with a second synchronisation point. However as mentioned before, this is somewhat driver specific. For example some drivers might be able to detect the case where the section of the buffer you're updating isn't in use by the previous draw call, whilst others might not. Something of this nature is probably what's causing the performance difference you're seeing.

windowed OpenGL first frame delay after idle

I have windowed WinApi/OpenGL app. Scene is drawn rarely (compared to games) in WM_PAINT, mostly triggered by user input - MW_MOUSEMOVE/clicks etc.
I noticed, that when there is no scene moving by user mouse (application "idle") and then some mouse action by user starts, the first frame is drawn with unpleasant delay - like 300 ms. Following frames are fast again.
I implemented 100 ms timer, which only does InvalidateRect, which is later followed by WM_PAINT/draw scene. This "fixed" the problem. But I don't like this solution.
I'd like know why is this happening and also some tips how to tackle it.
Does OpenGL render context save resources, when not used? Or could this be caused by some system behaviour, like processor underclocking/energy saving etc? (Although I noticed that processor runs underclocked even when app under "load")
This sounds like Windows virtual memory system at work. The sum of all the memory use of all active programs is usually greater than the amount of physical memory installed on your system. So windows swaps out idle processes to disc, according to whatever rules it follows, such as the relative priority of each process and the amount of time it is idle.
You are preventing the swap out (and delay) by artificially making the program active every 100ms.
If a swapped out process is reactivated, it takes a little time to retrieve the memory content from disc and restart the process.
Its unlikely that OpenGL is responsible for this delay.
You can improve the situation by starting your program with a higher priority.
https://superuser.com/questions/699651/start-process-in-high-priority
You can also use the virtuallock function to prevent Windows from swapping out part of the memory, but not advisable unless you REALLY know what you are doing!
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx
EDIT: You can improve things for sure by adding more memory and for sure 4GB sounds low for a modern PC, especially if you Chrome with multiple tabs open.
If you want to be scientific before spending any hard earned cash :-), then open Performance Manager and look at Cache Faults/Sec. This will show the swap activity on your machine. (I have 16GB on my PC so this number is very low mostly). To make sure you learn, I would check Cache Faults/Sec before and after the memory upgrade - so you can quantify the difference!
Finally, there is nothing wrong with the solution you found already - to kick start the graphic app every 100ms or so.....
Problem was in NVidia driver global 3d setting -"Power management mode".
Options "Optimal Power" and "Adaptive" save power and cause the problem.
Only "Prefer Maximum Performance" does the right thing.

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

Smooth play of live network audio samples

I am writing a client/server app in that server send live audio data that capture audio samples that captured from some external device( mic. for example ) and send it to the client. Then client want to play those samples. My app will run on local network so I have no problem with bandwidth( My sound is 8k, 8bit stereo while my net card 1000Mb ). In client I buffer the data for a small time and then start playback. and as data arrive from server I send them to sound card. This seems to work fine but there is a problem:
when my buffer in the client side finished, I will experience gaps in played sound.
I consider this is because of the difference in sampling time of the server and the client, it means that 8K on server is not same as 8K on client.
I can solve this with pausing client's playback and buffer again, but my boss doesn't accept it, since I have proper bandwidth and I should be able to play sound with no gap or pause.
So I decided to dynamically change playback speed in the client but I don't know how.
I am programming in Windows( native ) and I currently use waveOutXXX to play the sound. I can use any other native library( DirectX/DirectSound, Jack or ... ) but they should provide a smooth playback in the client.
I have programmed with waveOutXXX many times without any problem and I know it good but I can't solve my problem of dynamic resampling
I would suggest that your problem isn't likely due to mis-matched sample rates, but something to do with your buffering. You should be continuously dumping data to the sound card, and continuously filling your buffer. Use a reasonable buffer size... 300ms should be enough for most applications.
Now, over long periods of time, it is possible for the clock on the recording side and the clock on the playback side to drift apart enough that the 300ms buffer is no longer sufficient. I would suggest that rather than resampling at such a small difference, which could introduce artifacts, simply add samples at the encoding end. You still record at 8kHz, but you might add a sample or two every second, to make that 8.001kHz or so. Simply doubling one of the existing samples for this (or even a simple average between one sample and the next) will not be audible. Adjust this as necessary for your application.
I had a similar problem in an application I worked on. It did not involve network, but it did involve source data being captured in real-time at a certain fixed sampling rate, a large amount of signal processing, and finally output to the sound card at a fixed rate. Like you, I had gaps in the playback at buffer boundaries.
It seemed to me like the problem was that the processing being done caused audio data to make it to the sound card in a very jerky manner. That is, it would get a large chunk, then it would be a long time before it got another chunk. The overall throughput was correct, but this latency caused the sound card to often be starved for data. I suppose you may have the same situation with the network piece in your system.
The way I solved it was to first make the audio buffer longer. Then, every time a new chunk of audio was received, I checked how full the buffer was. If it was less than 20% full, I would write some silence to make it around 60% full.
You may think that this goes against reducing the gaps in playback since it is actually adding a gap, but it actually helps. The problem that I was having was that even though I had a significantly large audio buffer, I was always right at the verge of it being empty. With the other latencies in the system, this resulted in playback gaps on almost every buffer.
Writing the silence when the buffer started to get empty, but before it actually did, ensured that the buffer always had some data to spare if the processing fell behind a little. Also, just a single small gap in playback is very hard to notice compared to many periodic gaps.
I don't know if this will work for you, but it should be easy to implement and try out.

Resources