does wglGetCurrentContext sync GPU and CPU? - performance

when programming with OpenGL, glGet functions should be avoided because they force the GPU and CPU to synchronize, does this also apply to the wgl function "wglGetCurrentContext" which obtains a handle to the current OpenGL context? if not, are there any other performance problems around wglGetCurrentContext?

There is some misconception implied in your question:
glGet functions should be avoided because they force the GPU and CPU to synchronize
The first part is true. The second part is not. Most glGet*() call do not force a GPU/CPU synchronization. They only get state stored in the driver code, and do not involve the GPU at all.
There are some exceptions, which include the glGet*() calls that actually get data produced by the GPU. Typical examples include:
glGetBufferSubData(): Has to block if data is produced by the GPU (e.g. using transform feedback).
glGetTexImage(): Block if texture data is produced by GPU (e.g. if used as a render target).
glGetQueryObjectiv(..., GL_QUERY_RESULT, ...): Blocks if query has not finished.
Now, it's still true that you should avoid glGet*() calls where possible. Mainly for two reasons:
Most of them are unnecessary, since they query state that you set yourself, so you should already know what it is. And any unnecessary call is a waste.
They may cause synchronization between threads in multi-threaded driver implementations. So they may result in synchronization, but not with the GPU.
There are of course some good uses for glGet*() calls, for example:
To get implementation limits, like maximum texture sizes, etc. Call once during startup.
Calls like glGetUniformLocation() and glGetAttribLocation(). Make sure that you only call them once after shader linkage. Even though these are also avoidable by using qualifiers in the shader code at least in recent GLSL versions.
glGetError(), during debugging only.
As for wglGetCurrentContext(), I doubt that it would be very expensive. The current context is typically stored in some kind of thread local storage, which can be accessed very efficiently.
Still, I don't see why calling it would be necessary. If you need the context again later, you can store it away when you create it. And if that's not possible for some reason, you can call wglGetCurrentContext() once, and store the result. There definitely shouldn't be a need to call it repeatedly.

All of the WGL functions vary widely in there effects on the performance characteristics depending on the vendor and driver.
I don't expect that wglGetCurrentContext would be an especially performance hogging call (unless you do it a ton of times). Since the WGL functions are generally divorced from the GL context's state vector.
That being said, SETTING the current context will cause all manner of syncing between contexts, often in deeply undocumented ways. I've dealt with a couple of AMD and Intel driver bugs where some things that were supposed to be synchronized via other means could ONLY be synchronized with a redundant MakeCurrent call every frame.

Related

Can I bypass cache in OpenCL?

I have actually never met a case that I would need the value I wrote to global memory be cached. But I can find no way to stop GPU from polluting the cache as I can do on a CPU by using non-temporal writes.
It's a serious problem that can drop the performance by 20% or more.
There is little recent info about this, but what makes you think writes are cached at all? Unless you are using atomic operations, the GPU does not care about coherency. If you read a memory location after you write into it, you get undefined results even within the same work group, unless you put a global memory barrier in between the operations. That means caching the written value is pointless, because at that point all of your shader executions must have already written their data. You can be sure that won't fit in any cache!
GPU is a completely different beast than CPUs are. Concepts found in one don't easily translate to the other.
These are just my assumptions, which could be wrong, but what I'm sure of is that vendors try their best to optimize their GPUs for the currently most common operations done on them, just so they can boast by achieving a little higher FPS in current titles than the competition. Trying to outsmart them is generally not a good idea.

Do I need to use glMapRange with buffer orphaning to avoid stalls?

I've been told that buffer orphaning (i.e. calling glBufferData() with NULL for the final arg) allows us to avoid stalls that occur if GPU is trying to read a buffer object while CPU is trying to write to it.
What I'm not clear on is whether we can use this approach without glMapBuffer*(), or whether glMapBuffer*() is integral to the idea of orphaning and stall avoidance? I ask this purely to avoid making unnecessary changes to an existing codebase, though I understand glMapBuffer*() to be an inherently better choice than repeated glBufferData() in the long run.
(Please respond specific to OpenGL ES 2.0, unless the answer is general across GL versions.)
The answer was in the official location:
The first way is to call glBufferData​ with a NULL pointer, and the
exact same size and usage hints it had before. This allows the
implementation to simply reallocate storage for that buffer object
under-the-hood. Since allocating storage is (likely) faster than the
implicit synchronization, you gain significant performance advantages
over synchronization. And since you passed NULL, if there wasn't a
need for synchronization to begin with, this can be reduced to a
no-op. The old storage will still be used by the OpenGL commands that
have been sent previously. If you continue to use the same size
over-and-over, it is likely that the GL driver will not be doing any
allocation at all, but will just be pulling an old free block off the
unused buffer queue and use it (though of course this isn't
guaranteed), so it is likely to be very efficient.
You can do the same thing when using glMapBufferRange​ with the
GL_MAP_INVALIDATE_BUFFER_BIT. You can also use
glInvalidateBufferData​, where available.
So I suppose it's safe to assume that only glBufferData() calls are needed for orphaning.

Atomicity, Volatility and Thread Safety in Windows

It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks
it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.
In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx

Multiple contexts per application vs multiple applications per context

I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements. Currently, applications usually have their own context, meaning whatever data might be identical across different applications, it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls. From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design. So technically, everything that might be reused across applications, and especially heavy stuff like bitmap or geometry caches for text and UI, is just clogging the server with unnecessary transfers and memory usage.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements.
That depends on the task at hand. A small detour: Take a webbrowser for example, where JavaScript performs manipulations on the DOM; CSS transform and SVG elements define graphical elements. Each JavaScript called in response to an event may run as a separate thread/lighweight process. In a matter of sense the webbrowser is a rendering engine (heck they're internally even called rendering engines) for a whole bunch of applications.
And for that it's a good idea.
And in general display servers are a very good thing. Just have a look at X11, which has an incredible track record. These days Wayland is all the hype, and a lot of people drank the Kool-Aid, but you actually want the abstraction of a display server. However not for the reasons you thought. The main reason to have a display server is to avoid redundant code (not redundant data) and to have only a single entity to deal with the dirty details (color spaces, device physical properties) and provide optimized higher order drawing primitives.
But in regard with the direct use of OpenGL none of those considerations matter:
Currently, applications usually have their own context, meaning whatever data might be identical across different applications,
So? Memory is cheap. And you don't gain performance by coalescing duplicate data, because the only thing that matters for performance is the memory bandwidth required to process this data. But that bandwidth doesn't change because it only depends on the internal structure of the data, which however is unchanged by coalescing.
In fact deduplication creates significant overhead, since when one application made changes, that are not to affect the other application a copy-on-write operation has to be invoked which is not for free, usually means a full copy, which however means that while making the whole copy the memory bandwidth is consumed.
However for a small, selected change in the data of one application, with each application having its own copy the memory bus is blocked for much shorter time.
it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls.
Resource management and rendering normally do not interfere with each other. While the GPU is busy turning scalar values into points, lines and triangles, the driver on the CPU can do the housekeeping. In fact a lot of performance is gained by keeping making the CPU do non-rendering related work while the GPU is busy rendering.
From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design
Where did you read that? There's no such constraint/requirement on this in the OpenGL specifications and real OpenGL implementations (=drivers) are free to parallelize as much as they want.
just clogging the server with unnecessary transfers and memory usage.
Transfer happens only once, when the data gets loaded. Memory bandwidth consumption is unchanged by deduplication. And memory is so cheap these days, that data deduplication simply isn't worth the effort.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I think you completely misunderstand the nature of OpenGL. OpenGL is not a scene graph. There's no scene, there are mo models in OpenGL. Each applications has its own layout of data and eventually this data gets passed into OpenGL to draw pixels onto the screen.
To OpenGL however there are just drawing commands to turn arrays of scalar values into points, lines and triangles on the screen. There's nothing more to it.

How to temporarily disable OpenGL command queueing, for more accurate profiling results?

In Mac OS X's OpenGL Profiler app, I can get statistics regarding how long each GL function call takes. However, the results show that a ton of time is spent in flush commands (glFlush, glFlushRenderAPPLE, CGLFlushDrawable) and in glDrawElements, and every other GL function call's time is negligibly small.
I assume this is because OpenGL is enqueueing the commands I submit, and waiting until flushing or drawing to actually execute the commands.
I guess I could do something like this:
glFlush();
startTiming();
glDoSomething();
glFlush();
stopTimingAndRecordDelta();
...and insert that pattern around every GL function call my app makes, but that would be tedious since there are thousands of GL function calls throughout my app, and I'd have to tabulate the results manually (instead of using the already-existent OpenGL Profiler tool).
So, is there a way to disable all OpenGL command queueing, so I can get more accurate profiling results?
So, is there a way to disable all OpenGL command queueing, ...
No, there isn't an OpenGL function that does that.
..., so I can get more accurate profiling results?
You can get more accurate information than you are currently, but you'll never get really precise answers (but you can probably get what you need). While the results of OpenGL rendering are the "same" — OpenGL's not guaranteed to be pixel-accurate across implementations — they're supposed to be very close. However, how the pixels are generated can vary drastically. In particular, tiled-reneders (in mobile and embedded devices) usually don't render pixels during a draw call, but rather queue up the geometry, and generate the pixels at buffer swap.
That said, for profiling OpenGL, you want to use glFinish, instead of glFlush. glFinish will force all pending OpenGL calls to complete and return; glFlush merely requests that commands be sent to the OpenGL "at some time in the future", so it's not deterministic. Be sure to remove your glFinish in your "production" code, since it will really slow down your application. From your example, if you replace the flushes with finishes in your example, you'll get more interesting information.
You are using OpenGL 3, and in particular discussing OS X. Mavericks (10.9) supports Timer Queries, which you can use to time a single GL operation or an entire sequence of operations at the pipeline level. That is, how long they take to execute when GL actually gets around to performing them, rather than timing how long a particular API call takes to return (which is often meaningless). You can only have a single timer query in the pipeline at a given time unfortunately, so you may have to structure your software cleverly to make best use of them if you want command-level granularity.
I use them in my own work to time individual stages of the graphics engine. Things like how long it takes to update shadow maps, build the G-Buffers, perform deferred / forward lighting, individual HDR post-processing effects, etc. It really helps identify bottlenecks if you structure the timer queries this way instead of focusing on individual commands.
For instance on some filtrate limited hardware shadow map generation is the biggest bottleneck, on other shader limited hardware, lighting is. You can even use the results to determine the optimal shadow map resolution or lighting quality to meet a target framerate for a particular host without requiring the user to set these parameters manually. If you simply timed how long the individual operations took you would never get the bigger picture, but if you time entire sequences of commands that actually do some major part of your rendering you get neatly packed information that can be a lot more useful than even the output from profilers.

Resources