I am builing an interactive application using the latest wxWidgets(3.0.1) and OpenGL(3.3) on Windows. I have gotten to the point where I have a wxGLcanvas rendering onto a wxPanel and it works fine. The rendering is done in the paint event for that GLcanvas. However, I now want to perform simulations with an accurate delta time between updates.
So essentially I would like some sort of functionality that would allow me to have a method like
void Update(float dt)
{
// Update simulation with accurate time-step
}
I have come across timers but I'm not sure how I could incorporate it into my application to get an accurate dt. Some have mentioned creating a separate thread for that panel to update independently from the rest of the application. Is this an option and roughly how is it implemented if that is the case?
Timer precision should be good enough to achieve at least ~50 FPS, so they should be good enough for your purpose. You need to use a timer to get regular calls to your timer event handler which can then use a more accurate method (as timers are not guaranteed to be perfectly regular) to determine whether an updated is needed, e.g. using wxDateTime::UNow(), and queue a Refresh() in this case.
Notice that you still will not be able to guarantee regular window refreshes using normal GUI frameworks.
What operating system are you using? There is no way in Windows to guarantee an accurate time for an event. Unix, however is a real time operating system ( RTOS ) and it is possible to guarantee event times.
What range of delta times do you care about? The distinction between RTOS's and Windows only matters for times less than 2 to 3 milliseconds ( unless windows becomes ridiculously overloaded ) Since any worthwhile 3D scene will take more that 3 milliseconds to render, this is probably a non-issue!
Of most significance in this kind of question is the response time of human vision. Watching a dynamic diagram such as a plot, people are unaware of any refresh rate faster than 300 ms. If you want to attempt a video realistic effect, you might need a refresh rate approaching 25 ms - but then your problem will not be the accuracy of your timer, but the speed of your scene rendering. So, once again, this is a non-issue.
Related
I am developing a GPU-based simulation using OpenGL and GLSL-Shaders and i found that performance increases when I add additional (unnecessary) GL-commands.
The simulation runs entirely on GPU without any transfers and basically consists of a loop performing 2500 algorithmically identical time steps. I carefully implemented caching of GLSL-uniform locations and removed any GL-state requests (glGet* etc) to maximize speed. To measure wall clock time i've put a glFinish after the main loop and take the elapsed time afterwards.
CASE A:
Normal total runtime for all iterations is 490ms.
CASE B:
Now, if i add a single additional glGetUniformLocation(...) command at the end of EACH time step, it requires only 475ms in total, which is 3 percent faster. (Please note that this is relevant to me since later i will perform a lot more timesteps)
I've looked at a timeline captured with Nvidia nsight and found that, in case A, all opengl commands are issued within the first 140ms and the glFinish takes 348ms until completion of all GPU-work. In case B the issuing of opengl commands is spread out over a significantly longer time (410ms) and the glFinish only takes 64ms yielding the faster 475ms in total.
I also noticed, that hardware command queue is much more full of work packets most of the time in case B, whereas in case A there is only one item waiting most of the time (however, there are no visible idle times).
So my questions are:
Why is B faster?
Why are the command packages issued more uniformly to the hardware queue over time in case A?
How can speed be enhanced without adding additional commands?
I am using Visual c++, VS2008 on Win7 x64.
IMHO this question can not be answered definitely. For what it's worth I experimentally determined, that glFinish (and …SwapBuffers for that matter) have weird runtime time behavior. I'm currently developing my own VR rendering library and prior to that I spend some significant time profiling the timelines of OpenGL commands and their interaction with the graphics system. And what I found out was, that the only thing that's consistent is, that glFinish + …SwapBuffers have a very inconsistent timing behavior.
What could happen is, that this glGetUniformLocation call pulls the OpenGL driver into a "busy" state. If you call glFinish immediately afterwards it may use a different method for waiting (for example it may spin in a while loop waiting for a flag) for the GPU than if you just call glFinish (it may for example wait for a signal or a condition variable and is thus subject to the kernels scheduling behavior).
In Mac OS X's OpenGL Profiler app, I can get statistics regarding how long each GL function call takes. However, the results show that a ton of time is spent in flush commands (glFlush, glFlushRenderAPPLE, CGLFlushDrawable) and in glDrawElements, and every other GL function call's time is negligibly small.
I assume this is because OpenGL is enqueueing the commands I submit, and waiting until flushing or drawing to actually execute the commands.
I guess I could do something like this:
glFlush();
startTiming();
glDoSomething();
glFlush();
stopTimingAndRecordDelta();
...and insert that pattern around every GL function call my app makes, but that would be tedious since there are thousands of GL function calls throughout my app, and I'd have to tabulate the results manually (instead of using the already-existent OpenGL Profiler tool).
So, is there a way to disable all OpenGL command queueing, so I can get more accurate profiling results?
So, is there a way to disable all OpenGL command queueing, ...
No, there isn't an OpenGL function that does that.
..., so I can get more accurate profiling results?
You can get more accurate information than you are currently, but you'll never get really precise answers (but you can probably get what you need). While the results of OpenGL rendering are the "same" — OpenGL's not guaranteed to be pixel-accurate across implementations — they're supposed to be very close. However, how the pixels are generated can vary drastically. In particular, tiled-reneders (in mobile and embedded devices) usually don't render pixels during a draw call, but rather queue up the geometry, and generate the pixels at buffer swap.
That said, for profiling OpenGL, you want to use glFinish, instead of glFlush. glFinish will force all pending OpenGL calls to complete and return; glFlush merely requests that commands be sent to the OpenGL "at some time in the future", so it's not deterministic. Be sure to remove your glFinish in your "production" code, since it will really slow down your application. From your example, if you replace the flushes with finishes in your example, you'll get more interesting information.
You are using OpenGL 3, and in particular discussing OS X. Mavericks (10.9) supports Timer Queries, which you can use to time a single GL operation or an entire sequence of operations at the pipeline level. That is, how long they take to execute when GL actually gets around to performing them, rather than timing how long a particular API call takes to return (which is often meaningless). You can only have a single timer query in the pipeline at a given time unfortunately, so you may have to structure your software cleverly to make best use of them if you want command-level granularity.
I use them in my own work to time individual stages of the graphics engine. Things like how long it takes to update shadow maps, build the G-Buffers, perform deferred / forward lighting, individual HDR post-processing effects, etc. It really helps identify bottlenecks if you structure the timer queries this way instead of focusing on individual commands.
For instance on some filtrate limited hardware shadow map generation is the biggest bottleneck, on other shader limited hardware, lighting is. You can even use the results to determine the optimal shadow map resolution or lighting quality to meet a target framerate for a particular host without requiring the user to set these parameters manually. If you simply timed how long the individual operations took you would never get the bigger picture, but if you time entire sequences of commands that actually do some major part of your rendering you get neatly packed information that can be a lot more useful than even the output from profilers.
There are plenty of examples in Windows of applications triggering code at fairly high and stable framerates without spiking the CPU.
WPF/Silverlight/WinRT applications can do this, for example. So can browsers and media players. How exactly do they do this, and what API calls would I make to achieve the same effect from a Win32 application?
Clock polling doesn't work, of course, because that spikes the CPU. Neither does Sleep(), because you only get around 50ms granularity at best.
They are using multimedia timers. You can find information on MSDN here
Only the view is invalidated (f.e. with InvalidateRect)on each multimedia timer event. Drawing happens in the WM_PAINT / OnPaint handler.
Actually, there's nothing wrong with sleep.
You can use a combination of QueryPerformanceCounter/QueryPerformanceFrequency to obtain very accurate timings and on average you can create a loop which ticks forward on average exactly when it's supposed to.
I have never seen a sleep to miss it's deadline by as much as 50 ms however, I've seen plenty of naive timers that drift. i.e. accumalte a small delay and conincedentally updates noticable irregular intervals. This is what causes uneven framerates.
If you play a very short beep on every n:th frame, this is very audiable.
Also, logic and rendering can be run independently of each other. The CPU might not appear to be that busy, but I bet you the GPU is hard at work.
Now, about not hogging the CPU. CPU usage is just a break down of CPU time spent by a process under a given sample (the thread schedulerer actually tracks this). If you have a target of 30 Hz for your game. You're limited to 33ms per frame, otherwise you'll be lagging behind (too slow CPU or too slow code), if you can't hit this target you won't be running at 30 Hz and if you hit it under 33ms then you can yield processor time, effectivly freeing up resources.
This might be an intresting read for you as well.
On a side note, instead of yielding time you could effecivly be doing prepwork for future computations. Some games when they are not under the heaviest of loads actually do things as sorting and memory defragmentation, a little bit here and there, adds up in the end.
Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.
I need to write a screencast, and need to detect when window content has changed, even only text was selected. This window is third party control.
Ther're several methods.
(1) Screen polling.
You can poll the screen (that is, create a DIB, each time period to BitBlt from screen to it), and then send it as-is
Pros:
Very simple to implement
Cons:
High CPU load. Polling the entire screen number of times per second is very heavy (a lot of data should be transferred). Hence it'll be heavy and slow.
High network bandwidth
(2) Same as above, except now you do some analyzing of the polled screen to see the difference. Then you may send only the differences (and obviously don't send anything if no changes), plus you may optionally compress the differences stream.
Pros:
Still not too complex to implement
Significantly lower network bandwidth
Cons:
Even higher CPU usage.
(3) Same as above, except that you don't poll the screen constantly. Instead you do some hooking for your control (like spying for Windows messages that the control receives). Then you try learn when your control is supposed to redraw itself, and do the screen polling only in those scenarios.
Pros:
Significantly lower CPU usage
Still acceptable network bandwidth
Cons:
Implementation becomes complicated. Things like injecting hooks and etc.
Since this is based on some heuristic - you're not guaranteed (generally speaking) to cover all possible scenarios. In some circumstances you may miss the changes.
(4)
Hook at lower level: intercept calls to the drawing functions. Since there's enormous number of such functions in the user mode - the only realistic possibility of doing this is in the kernel mode.
You may write a virtual video driver (either "mirror" video driver, or hook the existing one) to receive all the drawing in the system. Then whenever you receive a drawing request on the specific area - you'll know it's changed.
Pros:
Lower CPU usage.
100% guarantee to intercept all drawings, without heuristics
Somewhat cleaner - no need to inject hooks into apps/controls
Cons:
It's a driver development! Unless you're experienced in it - it's a real nightmare.
More complex installation. Need administrator rights, most probably need restart.
Still considerable CPU load and bandwidth
(5)
Going on with driver development. As long as you know now which drawing functions are called - you may switch the strategy now. Instead of "remembering" dirty areas and polling the screen there - you may just "remember" the drawing function invoked with all the parameters, and then "repeat" it at the host side.
By such you don't have to poll the screen at all. You work in a "vectored" method (as opposed to "raster").
This however is much more complex to implement. Some drawing functions take as parameters another bitmaps, which in turn are drawn using another drawing functions and etc. You'll have to spy for bitmaps as well as screen.
Pros:
Zero CPU load
Best possible network traffic
Guaranteed to work always
Cons:
It's a driver development at its best! Months of development are guaranteed
Requires state-of-the-art programming, deep understanding of 2D drawing
Need to write the code at host which will "draw" all the "Recorded" commands.