WebGL - When to call gl.Flush? - opengl-es

I just noticed today that this method, Flush() is available.
Not able to find detailed documentation on it.
What exactly does this do?
Is this required?

gl.flush in WebGL does have it's uses but it's driver and browser specific.
For example, because Chrome's GPU architecture is multi-process you can do this
function var loadShader = function(gl, shaderSource, shaderType) {
var shader = gl.createShader(shaderType);
gl.shaderSource(shader, shaderSource);
gl.compileShader(shader);
return shader;
}
var vs = loadShader(gl, someVertexShaderSource, gl.VERTEX_SHADER);
var fs = loadShader(gl, someFragmentShaderSource, FRAGMENT_SHADER);
var p = gl.createProgram();
gl.attachShader(p, vs);
gl.attachShader(p, fs);
gl.linkProgram(p);
At this point all of the commands might be sitting in the command
queue with nothing executing them yet. So, issue a flush
gl.flush();
Now, because we know that compiling and linking programs is slow depending on how large and complex they are so we can wait a while before trying using them and do other stuff
setTimeout(continueLater, 1000); // continue 1 second later
now do other things like setup the page or UI or something
1 second later continueLater will get called. It's likely our shaders finished compiling and linking.
function continueLater() {
// check results, get locations, etc.
if (!gl.getShaderParameter(vs, gl.COMPILE_STATUS) ||
!gl.getShaderParameter(fs, gl.COMPILE_STATUS) ||
!gl.getProgramParameter(p, gl.LINK_STATUS)) {
alert("shaders didn't compile or program didn't link");
...etc...
}
var someLoc = gl.getUniformLocation(program, "u_someUniform");
...etc...
}
I believe Google Maps uses this technique as they have to compile many very complex shaders and they'd like the page to stay responsive. If they called gl.compileShader or gl.linkProgram and immediately called one of the query functions like gl.getShaderParameter or gl.getProgramParameter or gl.getUniformLocation the program would freeze while the shader is first validated and then sent to the driver to be compiled. By not doing the querying immediately but waiting a moment they can avoid that pause in the UX.
Unfortunately this only works for Chrome AFAIK because other browsers are not multi-process and I believe all drivers compile/link synchronously.
There maybe be other reasons to call gl.flush but again it's very driver/os/browser specific. As an example let's say you were going to draw 1000 objects and to do that took 5000 webgl calls. It likely would require more than that but just to have a number lets pick 5000. 4 calls to gl.uniformXXX and 1 calls to gl.drawXXX per object.
It's possible all 5000 of those calls fit in the browser's (Chrome) or driver's command buffer. Without a flush they won't start executing until the the browser issues a gl.flush for you (which it does so it can composite your results on the screen). That means the GPU might be sitting idle while you issue 1000, then 2000, then 3000, etc.. commands since they're just sitting in a buffer. gl.flush tells the system "Hey, those commands I added, please make sure to start executing them". So you might decide to call gl.flush after each 1000 commands.
The problem though is gl.flush is not free otherwise you'd call it after every command to make sure it executes as soon as possible. On top of that each driver/browser works in different ways. On some drivers calling gl.flush every few 100 or 1000 WebGL calls might be a win. On others it might be a waste of time.
Sorry, that was probably too much info :p

Assuming it's semantically equivalent to the classic GL glFlush then no, it will almost never be required. OpenGL is an asynchronous API — you queue up work to be done and it is done when it can be. glFlush is still asynchronous but forces any accumulated buffers to be emptied as quickly as they can be, however long that may take; it basically says to the driver "if you were planning to hold anything back for any reason, please don't".
It's usually done only for a platform-specific reason related to the coupling of OpenGL and the other display mechanisms on that system. For example, one platform might need all GL work to be ordered not to queue before the container that it will be drawn into can be moved to the screen. Another might allow piping of resources from a background thread into the main OpenGL context but not guarantee that they're necessarily available for subsequent calls until you've flushed (e.g. if multithreading ends up creating two separate queues where there might otherwise be one then flush might insert a synchronisation barrier across both).
Any platform with a double buffer or with back buffering as per the WebGL model will automatically ensure that buffers proceed in a timely manner. Queueing is to aid performance but should have no negative observable consequences. So you don't have to do anything manually.
If you decline to flush and don't strictly need to even when you semantically perhaps should, but your graphics are predicated on real-time display anyway, then you're probably going to be suffering at worst a fraction of a second of latency.

Related

eglDestroyContext and unfinished drawcalls

Following scenario:
I have an OpenGL ES app that renders frames via drawcalls, the end of a frame is marked by eglSwapBuffers. Now imagine that after the last eglSwapBuffers I immediately call eglMakeCurrent to unbind the context from the surface, then immediately call eglDestroyContext. Assuming the context was the last one to hold references to any resources the drawcalls use (shader, buffer, textures etc.) what happens to the drawcalls that the gpu has not yet finished and that use some of these resources?
Regards.
then immediately call eglDestroyContext().
All this really says is that the application is done with the context, and promises not to use it again. The actual context includes a reference count held by the graphics driver, and that won't drop to zero until all pending rendering has actually completed.
TLDR - despite what the APIs say, nothing actually happens "immediately" when you make an API call - it's an elaborate illusion that is mostly a complete lie.

GetMessage() while the thread is blocked in SwapBuffers()

Vsync blocks SwapBuffers(), which is what I want. My problem is that, since input messages go to the same thread that owns the window, any messages that come in while SwapBuffers() is blocked won't be processed immediately, but only after the vsync triggers the buffer swap and SwapBuffers() returns. So I have all my compute threads sitting idle instead of processing the scene for rendering in the next frame using the most recent input. I'm particularly concerned with having very low latency. I need some way to access all pending input messages to the window from other threads.
Windows API provides a way to wait for either Windows events or input messages using MsgWaitForMultipleObjects(), yet there's no similar way to wait for a buffer swap together with other things. That's very unfortunate.
I considered calling SwapBuffers() in another thread, but that requires glFinish() to be called in the window's thread before signalling another thread to SwapBuffers(), and glFinish() is still a blocking call so it's not a good solution.
I considered hooking, but that also looks like a dead end. Hooking with WH_GETMESSAGE will have the GetMsgProc() be called not asynchronously, but when the window's thread calls GetMessage()/PeekMessage(), so it's no help. Installing a global hook doesn't help me either due to the need of calling RegisterTouchWindow() with a specific window handle to process WM_TOUCH -- and my input is touch. And, while for mouse and keyboard, you can install low level hooks that capture messages as they're posted to the thread's queue, rather than when the thread calls GetMessage()/PeekMessage(), there appears to be no similar option for touch.
I also looked at wglDelayBeforeSwapNV(), but I don't see what's preventing the OS from preempting a thread sometimes after the call to that function but before SwapBuffers(), causing a miss of the next vsync signal.
So what's a good workaround? Can I make a second, invisible window, that will somehow be always the active one and so get all input messages, while the visible one is displaying the rendering? According to another discussion, message-only windows (CreateWindow with HWND_MESSAGE) are not compatible with WM_TOUCH. Is there perhaps some undocumented event that SwapBuffers() is internally waiting on that I could access and feed to MsgWaitForMultipleObjects()? My target is a fixed platform (Windows 8.1 64-bit) so I'm fine with using undocumented functionality, should it exist. I do want to avoid writing my own touchscreen driver, however.
Out of curiosity, why not implement your entire drawing logic in that other thread? It appears the problem you are running into is that the message pump is driven by the same thread that draws. Since Windows does not let you drive the message pump from a different thread than the one that created the window, the easiest solution would just be to push all the GL stuff into a different thread.
SwapBuffers (...) is also not necessarily going to block. As-per requirements of VSYNC an implementation need only block the next command that would modify the backbuffer while all backbuffers are pending a swap. Triple buffering changes things up a little bit by introducing a second backbuffer.
One possible implementation of triple buffering will discard the oldest backbuffer when it comes time to swap, thus SwapBuffers (...) would never cause blocking (this is effectively how modern versions of Windows work in windowed mode with the DWM enabled). Other implementations will eventually present both backbuffers, this reduces (but does not eliminate) blocking but also results in the display of late frames.
Unfortunately WGL does not let you request the number of backbuffers in a swap-chain (beyond 0 single-buffered or 1 double-buffered); the only way to get triple buffering on Windows is using driver settings. Lowest message latency will come from driving GL in a different thread, but triple buffering can help a little bit while requiring no effort on your part.

Switch OpenGL contexts or switch context render target instead, what is preferable?

On MacOS X, you can render OpenGL to any NSView object of your choice, simply by creating an NSOpenGLContext and then calling -setView: on it. However, you can only associate one view with a single OpenGL context at any time. My question is, if I want to render OpenGL to two different views within a single window (or possibly within two different windows), I have two options:
Create one context and always change the view, by calling setView as appropriate each time I want to render to the other view. This will even work if the views are within different windows or on different screens.
Create two NSOpenGLContext objects and associate one view with either one. These two contexts could be shared, which means most resources (like textures, buffers, etc.) will be available in both views without wasting twice the memory. In that case, though, I have to keep switching the current context each time I want to render to the other view, by calling -makeCurrentContext on the right context before making any OpenGL calls.
I have in fact used either option in the past, each of them worked okay for my needs, however, I asked myself, which way is better in terms of performance, compatibility, and so on. I read that context switching is actually horribly slow, or at least it used to be very slow in the past, might have changed meanwhile. It may depend on how many data is associated with a context (e.g. resources), since switching the active context might cause data to be transferred between system memory and GPU memory.
On the other hand switching the view could be very slow as well, especially if this might cause the underlying renderer to change; e.g. if your two views are part of two different windows located on two different screens that are driven by two different graphic adapters. Even if the renderer does not change, I have no idea if the system performs a lot of expensive OpenGL setup/clean-up when switching a view, like creating/destroying render-/framebuffer objects for example.
I investigated context switching between 3 windows on Lion, where I tried to resolve some performance issues with a somewhat misused VTK library, which itself is terribly slow already.
Wether you switch render contexts or the windows doesn't really matter,
because there is always the overhead of making both of them current to the calling thread as a triple. I measured roughly 50ms per switch, where some OS/Window manager overhead charges in aswell. This overhead depends also greatly on the arrangement of other GL calls, because the driver could be forced to wait for commands to be finished, which can be achieved manually by a blocking call to glFinish().
The most efficient setup I got working is similar to your 2nd, but has two dedicated render threads having their render context (shared) and window permanently bound. Aforesaid context switches/bindings are done just once on init.
The threads can be controlled using some threading stuff like a common barrier, which lets both threads render single frames in sync (both get stalled at the barrier before they can be launched again). Data handling must also be interlocked, which can be done in one thread while stalling other render threads.

Total system freezing when using timers in graphical application

I’m really stuck with this issue and will greatly appreciate any advice.
The problem:
Some of our users complain about total system “freezing” when using our product. No matter how we tried, we couldn’t reproduce it in any of systems available for troubleshooting.
The product:
Physically, it’s a 32bit/64bit DLL. The product has a self-refreshing GUI, which draws a realtime spectrogram of an audio signal
Problem details:
What I managed to collect from a number of fragmentary reports makes the following picture:
When GIU is opened, sometimes immediately, sometimes after a few minutes of GIU being visible, the system completely stalls, without possibility to operate with windows, start Task Manager etc. No reactions on keyboard, no mouse cursor seen (or it’s seen but is not responsibe to mouse movements – this I do not know). The user has to hard-reset the system in order to reboot. What is important, I think, is that (in some cases) for some time the GIU is responsive and shows some adequate pictures. Then this freezing happens. One of the reports tells that once the system was frozen, the audio continued to be rendered – i.e. heard by the reporter (but the whole graphic shell of Windows was already frozen). Note: in this sort of apps it’s usually a specialized thread which is responsible for sound processing.
The freezing is more or less confirmed to happen for 2 users on Windows7 x64 using both 32 and 64 bit versions of the DLL, never heard of any other OSs mentioned with connection to this freezing (though there was 1 report without any OS specified).
That’s all that I managed to collect.
The architecture / suspicions:
I strongly suspect that it’s the GUI refreshing cycle that is a culprit.
Basically, it works like this:
There is a timer that triggers callbacks at a frame rate of approx 25 fps.
In this callback audio analysis is performed and GUI updated
Some details about the timer:
It’s based on this call:
CreateTimerQueueTimer(&m_timerHandle, NULL, xPlatformTimerCallbackWrapper,
this, m_firstExpInterval, m_period, WT_EXECUTEINTIMERTHREAD);
We create a timer and m_timerHandle is called periodically.
Some details about the GUI refreshing:
It works like this:
HDC hdc = GetDC (hwnd);
// Some drawing
ReleaseDC(hwnd,hdc);
My intuition tells me that this CreateTimeQueueTimer might be not the right decision. The reference page tells that in case of using WT_EXECUTEINTIMERTHREAD:
The callback function is invoked by the timer thread itself. This flag
should be used only for short tasks or
it could affect other timer
operations. The callback function is
queued as an APC. It should not
perform alertable wait operations.
I don’t remember why this WT_EXECUTEINTIMERTHREAD option was chosen actually, now WT_EXECUTEDEFAULT seems equally suitable for me.
In fact, I don’t see any major difference in using any of the options mentioned in the reference page.
Questions:
Is anything of what was told give anyone any clue on what might be wrong?
Have you faced similar problems, what was the reason?
Thanks for any info!
==========================================
Update: 2010-02-20
Unfortunatelly, the advise given here (which I could check so far) didn't help, namelly:
changing to WT_EXECUTEDEFAULT in CreateTimerQueueTimer(&m_timerHandle,NULL,xPlatformTimerCallbackWrapper,this,m_firstExpInterval,m_period, WT_EXECUTEDEFAULT);
the reenterability guard was already there
I havent' yet checked if updateding the GUI in WM_PAINT hander helps or not
Thanks for the hints anyway.
Now, I've been playing with this for a while, also got a real W7 intallation (I used to use the virtual one) and it seems that the problem can be narrowed down.
On my installation, using of the app really get the GUI far less responsive, although I couldn't manage to reproduce a total system freezing as someone reported.
My assumption now is this responsiveness degradation and reported total freezing have a common origin.
Then I did some primitive profiling and found that at least one of the culprits is BitBlt function that is called approx 50 times a second
BitBlt ((HDC)pContext->getSystemContext (), // hdcDest
destRect.left + pContext->offset.h,
destRect.top + pContext->offset.v,
destRect.right - destRect.left,
destRect.bottom - destRect.top,
(HDC)pSystemContext,
srcOffset.h,
srcOffset.v,
SRCCOPY);
The regions being copied are not really large (approx. 400x200 pixels). It is used for displaying the backbuffer and is executed in the timer callback.
If I comment out this BitBlt call, the problem seems to disappear (at least partly).
On the same machine running WinXP everything works just fine.
Any ideas on this?
Most likely what's happening is that your timer callback is taking more than 25 ms to execute. Then another timer tick comes along and it starts processing, too. And so on, and pretty soon you have a whole bunch of threads sucking down CPU cycles, all trying to do your audio analysis and in short order the system is so busy doing thread context switches that no real work gets done. And all the while, more and more timer ticks are getting placed into the queue.
I would strongly suggest that you use WT_EXECUTEDEFAULT here, rather than WT_EXECUTEINTIMERTHREAD. Also, you need to prevent overlapping timer callbacks. There are several ways to do that.
You can use a critical section in your timer callback. When the callback is triggered it calls TryEnterEnterCriticalSection and if not successful, just returns without doing anything.
You can do something similar using a volatile variable and InterlockedCompareExchange.
Or, you can change your timer to be a one-shot (WT_EXECUTEONLYONCE), and then re-set the timer at the end of every callback. That would make the thing execute 25 ms after the last one completed.
Which you choose is up to you. If your analysis often takes longer than 25 ms but not more than 35 ms, then you'll probably get a smoother update rate using WT_EXECUTEONLYONCE. If it's rare that analysis takes more than 25 ms, or if it often takes more than about 35 ms (but less than 50 ms), then you're probably better off using one of the other techniques.
Of course, if it often takes longer than 25 ms, then you probably want to increase the time (reduce the update rate).
Also, as one of the commenters pointed out, it's possible that the problem also involves accessing the GUI from the timer thread. You should do all of your analysis in the timer thread, store the results somewhere that the main thread can access it, and then send a message to the window proc, telling it to update the display.
Have you asked the users to disable Aero/WDMDWM? With Aero enabled, rendering is implemented quite different. Without Aero, the behaviour will be similar to XP. Not that it solves anything, but it will give you a clue as to what the problem is.

How many layers are between my program and the hardware?

I somehow have the feeling that modern systems, including runtime libraries, this exception handler and that built-in debugger build up more and more layers between my (C++) programs and the CPU/rest of the hardware.
I'm thinking of something like this:
1 + 2 >> OS top layer >> Runtime library/helper/error handler >> a hell lot of DLL modules >> OS kernel layer >> Do you really want to run 1 + 2?-Windows popup (don't take this serious) >> OS kernel layer >> Hardware abstraction >> Hardware >> Go through at least 100 miles of circuits >> Eventually arrive at the CPU >> ADD 1, 2 >> Go all the way back to my program
Nearly all technical things are simply wrong and in some random order, but you get my point right?
How much longer/shorter is this chain when I run a C++ program that calculates 1 + 2 at runtime on Windows?
How about when I do this in an interpreter? (Python|Ruby|PHP)
Is this chain really as dramatic in reality? Does Windows really try "not to stand in the way"? e.g.: Direct connection my binary <> hardware?
"1 + 2" in C++ gets directly translated in an add assembly instruction that is executed directly on the CPU. All the "layers" you refer to really only come into play when you start calling library functions. For example a simple printf("Hello World\n"); would go through a number of layers (using Windows as an example, different OSes would be different):
CRT - the C runtime implements things like %d replacements and creates a single string, then it calls WriteFile in kernel32
kernel32.dll implements WriteFile, notices that the handle is a console and directs the call to the console system
the string is sent to the conhost.exe process (on Windows 7, csrss.exe on earlier versions) which actually hosts the console window
conhost.exe adds the string to an internal buffer that represents the contents of the console window and invalidates the console window
The Window Manager notices that the console window is now invalid and sends it a WM_PAINT message
In response to the WM_PAINT, the console window (inside conhost.exe still) makes a series of DrawString calls inside GDI32.dll (or maybe GDI+?)
The DrawString method loops through each character in the string and:
Looks up the glyph definition in the font file to get the outline of the glyph
Checks it's cache for a rendered version of that glyph at the current font size
If the glyph isn't in the cache, it rasterizes the outline at the current font size and caches the result for later
Copies the pixels from the rasterized glyph to the graphics buffer you specified, pixel-by-pixel
Once all the DrawString calls are complete, the final image for the window is sent to the DWM where it's loaded into the graphics memory of your graphics card, and replaces the old window
When the next frame is drawn, the graphics card now uses the new image to render the console window and your new string is there
Now there's a lot of layers that I've simplified (e.g. the way the graphics card renders stuff is a whole 'nother layer of abstractions). I may have made some errors (I don't know the internals of how Windows is implemented, obviously) but it should give you an idea hopefully.
The important point, though, is that each step along the way adds some kind of value to the system.
As codeka said there's a lot that goes on when you call a library function but what you need to remember is that printing a string or displaying a jpeg or whatever is a very complicated task. Even more so when the method used must work for everyone in every situation; hundreds of edge cases.
What this really means is that when you're writing serious number crunching, chess playing, weather predicting code don't call library functions. Instead use only cheap functions which can and will be executed by the CPU directly. Additionally planning where your expensive functions are can make a huge difference (print everything at the end not each time through the loop).
It doesnt matter how many levels of abstraction there is, as long has the hard work is done in the most efficient way.
In a general sense you suffer from "emulating" your lowest level, e.g. you suffer from emulating a 68K CPU on a x86 CPU running some poorly implemented app, but it wont perform worse than the original hardware would. Otherwise you would not emulate it in the first place. E.g. today most user interface logic is implemented using high level dynamic script languages, because its more productive while the hardcore stuff is handled by optimized low level code.
When it comes to performance, its always the hard work that hits the wall first. The thing in between never suffers from performance issues. E.g. a key handler that processes 2-3 key presses a second can spend a fortune in badly written code without affecting the end user experience, while the motion estimator in an mpeg encoder will fail utterly by just being implemented in software instead of using dedicated hardware.

Resources