EGL/OpenGL ES/switching context is slow - opengl-es

I am developing an OpenGL ES 2.0 application (using angleproject on Windows for developement) that is made up of multiple 'frames'.
Each frame is an isolated application that should not interfere with the surrounding frames. The frames are drawn using OpenGL ES 2.0, by the code running inside of that frame.
My first attempt was to assign a frame buffer to each frame. But there was a problem - OpenGL's internal states are changed while one frame is drawing, and if the next frame doesn't comprehensively reset every known OpenGL state, there could be possible side effects. This defeats my requirement that each frame should be isolated and not affect one another.
My next attempt was to use a context per frame. I created a unique context for each frame. I'm using sharing resources, so that I can eglMakeCurrent to each frame, render each to their own frame buffer/texture, then eglMakeCurrent back to globally, to compose each texture to the final screen.
This does a great job at isolating the instances, however.. eglMakeCurrent is very slow. As little as 4 of them can make it take a second or more to render the screen.
What approach can I take? Is there a way I can either speed up context switching, or avoid context switching by somehow saving the OpenGL state per frame?

I have a suggestion that may eliminate the overhead of eglMakeCurrent while allowing you to use your current approach.
The concept of current EGLContext is thread-local. I suggest creating all contexts in your process's master thread, then create one thread per context, passing one context to each thread. During each thread's initialization, it will call eglMakeCurrent on the context it owns, and never call eglMakeCurrent again. Hopefully, in ANGLE's implementation, the thread-local storage for contexts is implemented efficiently and does not have unnecessary synchronization overhead.

The problem here is trying to do this in a generic platform and OS independent way. If you choose a specific platform, there are good solutions. On Windows, there are the wgl and glut libraries that will give you multiple windows with completely independent OpenGL contexts running concurrently. They are called Windows, not Frames. You could also use DirectX instead of OpenGL. Angle uses DirectX. On linux, the solution is X11 for OpenGL. In either case, it's critical to have quality OpenGL drivers. No Intel Extreme chipset drivers. If you want to do this on Android or iOS, then those require different solutions. There was a recent thread on the Khronos.org OpenGL ES forum about the Android case.

Related

OpenGL rendering & display in different processes [duplicate]

Let's say I have an application A which is responsible for painting stuff on-screen via OpenGL library. For tight integration purposes I would like to let this application A do its job, but render in a FBO or directly in a render buffer and allow an application B to have read-only access to this buffer to handle the display on-screen (basically rendering it as a 2D texture).
It seems FBOs belong to OpenGL contexts and contexts are not shareable between processes. I definitely understand that allowing several processes two mess with the same context is evil. But in my particular case, I think it's reasonable to think it could be pretty safe.
EDIT:
Render size is near full screen, I was thinking of a 2048x2048 32bits buffer (I don't use the alpha channel for now but why not later).
Framebuffer Objects can not be shared between OpenGL contexts, be it that they belong to the same process or not. But textures can be shared and textures can be used as color buffer attachment to a framebuffer objects.
Sharing OpenGL contexts between processes it actually possible if the graphics system provides the API for this job. In the case of X11/GLX it is possible to share indirect rendering contexts between multiple processes. It may be possible in Windows by emplyoing a few really, really crude hacks. MacOS X, no idea how to do this.
So what's probably the easiest to do is using a Pixel Buffer Object to gain performant access to the rendered picture. Then send it over to the other application through shared memory and upload it into a texture there (again through pixel buffer object).
In MacOS,you can use IOSurface to share framebuffer between two application.
In my understanding, you won't be able to share the objects between the process under Windows, unless it's a kernel mode object. Even the shared textures and contexts can create performance hits also it has give you the additional responsibility of syncing the SwapBuffer() calls. Especially under windows platform the OpenGL implementation is notorious.
In my opinion, you can relay on inter-process communication mechanisms like Events, mutex, window messages, pipes to sync the rendering. but just realize that there's a performance consideration on approaching in this way. Kernel mode objects are good but the transition to kernel each time has a cost of 100ms. Which is damns costly for a high performance rendering application. In my opinion you have to reconsider the multi-process rendering design.
On Linux, a solution is to use DMABUF, as explained in this blog: https://blaztinn.gitlab.io/post/dmabuf-texture-sharing/

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

How to temporarily disable OpenGL command queueing, for more accurate profiling results?

In Mac OS X's OpenGL Profiler app, I can get statistics regarding how long each GL function call takes. However, the results show that a ton of time is spent in flush commands (glFlush, glFlushRenderAPPLE, CGLFlushDrawable) and in glDrawElements, and every other GL function call's time is negligibly small.
I assume this is because OpenGL is enqueueing the commands I submit, and waiting until flushing or drawing to actually execute the commands.
I guess I could do something like this:
glFlush();
startTiming();
glDoSomething();
glFlush();
stopTimingAndRecordDelta();
...and insert that pattern around every GL function call my app makes, but that would be tedious since there are thousands of GL function calls throughout my app, and I'd have to tabulate the results manually (instead of using the already-existent OpenGL Profiler tool).
So, is there a way to disable all OpenGL command queueing, so I can get more accurate profiling results?
So, is there a way to disable all OpenGL command queueing, ...
No, there isn't an OpenGL function that does that.
..., so I can get more accurate profiling results?
You can get more accurate information than you are currently, but you'll never get really precise answers (but you can probably get what you need). While the results of OpenGL rendering are the "same" — OpenGL's not guaranteed to be pixel-accurate across implementations — they're supposed to be very close. However, how the pixels are generated can vary drastically. In particular, tiled-reneders (in mobile and embedded devices) usually don't render pixels during a draw call, but rather queue up the geometry, and generate the pixels at buffer swap.
That said, for profiling OpenGL, you want to use glFinish, instead of glFlush. glFinish will force all pending OpenGL calls to complete and return; glFlush merely requests that commands be sent to the OpenGL "at some time in the future", so it's not deterministic. Be sure to remove your glFinish in your "production" code, since it will really slow down your application. From your example, if you replace the flushes with finishes in your example, you'll get more interesting information.
You are using OpenGL 3, and in particular discussing OS X. Mavericks (10.9) supports Timer Queries, which you can use to time a single GL operation or an entire sequence of operations at the pipeline level. That is, how long they take to execute when GL actually gets around to performing them, rather than timing how long a particular API call takes to return (which is often meaningless). You can only have a single timer query in the pipeline at a given time unfortunately, so you may have to structure your software cleverly to make best use of them if you want command-level granularity.
I use them in my own work to time individual stages of the graphics engine. Things like how long it takes to update shadow maps, build the G-Buffers, perform deferred / forward lighting, individual HDR post-processing effects, etc. It really helps identify bottlenecks if you structure the timer queries this way instead of focusing on individual commands.
For instance on some filtrate limited hardware shadow map generation is the biggest bottleneck, on other shader limited hardware, lighting is. You can even use the results to determine the optimal shadow map resolution or lighting quality to meet a target framerate for a particular host without requiring the user to set these parameters manually. If you simply timed how long the individual operations took you would never get the bigger picture, but if you time entire sequences of commands that actually do some major part of your rendering you get neatly packed information that can be a lot more useful than even the output from profilers.

Thread effectiveness for graphical applications

Let's say I'm creating lighting for a scene, using my own shaders. It is a good example of a thing that might be divided between many threads, for example by dividing scene into smaller scenes and rendering them in separate threads.
Should I divide it into threads manually, or is graphic library going to somehow automatically divide such operations? Or is it library dependent (i'm using libgdx, which appears to be using OpenGL). Or maybe there is other reason why I should leave it alone in one thread?
If I should take care of dividing workload between threads manually, how many threads should I use? Is the number of threads in such situation graphic card dependent or processor dependent?
OpenGL does not support multi-threaded rendering since an OpenGL context is only valid on the thread which it is created.
What you could do to potentially gain some performance is separate your update logic and your rendering logic into separate threads. However, you can not leverage multiple threads for OpenGL rendering.

OpenGL multihead vsync with different refresh rates

How do I drive multiple displays (multihead) at different resolutions and refresh rates with OpenGL (on Windows 7) and still be able to share textures between the devices?
I have one multi-head gpu. It drives 4 heads. (It happens to be an AMD FirePro V7900 in case it matters.) The heads all share a "scene" (vertex and texture data, etc.), but I want render this scene each time a vsync occurs on the display (each head is essentially a different viewport). But the catch is that the different heads may be at different refresh rates. For example, some displays may be at 60Hz and some may be at 30Hz and some may be at 24Hz.
When I call SwapBuffers the call blocks, so I can't tell which head needs to be rendered to next. I was hoping for something like Direct3D9's IDirect3DSwapChain9::Present with D3DPRESENT_DONOTWAIT flag, and the associated D3DERR_WASSTILLDRAWING return value. Using that approach, I could determine which head to render to next. By round-robin polling the different heads until one succeeded. But I don't know what the equivalent approach is in OpenGL.
I've already discovered wglSwapIntervalEXT(1) to use vsync. And I can switch between HDC's to render to the different windows with a single HGLRC. But the refresh rate difference is messing me up.
I'm not sure what I can do to have a single HGLRC render all these displays at different refresh rates. I assume it has to be a single HGLRC to make efficient use of shared textures (and other resources)...correct me if I'm wrong. It's not interesting to me if the resources are duplicated with multiple HGLRC's because I would expect that would cut my memory down to 25% (4 heads on 1 GPU: so I don't want 4 copies of any resource.)
I'm open to the idea of using multiple threads, if that's what it takes.
Can someone tell me how to structure my main loop so that I can share resources but still drive the displays at their own refresh rates and resolutions?
You can share OpenGL buffers carrying data by sharing the contexts. The call in Windows is names wglShareLists.
Using that you can give each window it's own rendering context running in a own thread, while all of the contexts share their data. Multiple window V-Sync in fact one of the few cases where multithreaded OpenGL makes sense.
I have not done anything like this before.
Looks like you actaully need multiple threads to get independent refresh rates.
An OpenGL render context can be active in only one thread. One thread can only have one active render context. Therfore with multiple threads you will need multiple render contexts.
It is possible to share resources between OpenGL contexts. With this it is not necessary to store resources multiple times.

Resources