Thread effectiveness for graphical applications - performance

Let's say I'm creating lighting for a scene, using my own shaders. It is a good example of a thing that might be divided between many threads, for example by dividing scene into smaller scenes and rendering them in separate threads.
Should I divide it into threads manually, or is graphic library going to somehow automatically divide such operations? Or is it library dependent (i'm using libgdx, which appears to be using OpenGL). Or maybe there is other reason why I should leave it alone in one thread?
If I should take care of dividing workload between threads manually, how many threads should I use? Is the number of threads in such situation graphic card dependent or processor dependent?

OpenGL does not support multi-threaded rendering since an OpenGL context is only valid on the thread which it is created.
What you could do to potentially gain some performance is separate your update logic and your rendering logic into separate threads. However, you can not leverage multiple threads for OpenGL rendering.

Related

OpenGL rendering & display in different processes [duplicate]

Let's say I have an application A which is responsible for painting stuff on-screen via OpenGL library. For tight integration purposes I would like to let this application A do its job, but render in a FBO or directly in a render buffer and allow an application B to have read-only access to this buffer to handle the display on-screen (basically rendering it as a 2D texture).
It seems FBOs belong to OpenGL contexts and contexts are not shareable between processes. I definitely understand that allowing several processes two mess with the same context is evil. But in my particular case, I think it's reasonable to think it could be pretty safe.
EDIT:
Render size is near full screen, I was thinking of a 2048x2048 32bits buffer (I don't use the alpha channel for now but why not later).
Framebuffer Objects can not be shared between OpenGL contexts, be it that they belong to the same process or not. But textures can be shared and textures can be used as color buffer attachment to a framebuffer objects.
Sharing OpenGL contexts between processes it actually possible if the graphics system provides the API for this job. In the case of X11/GLX it is possible to share indirect rendering contexts between multiple processes. It may be possible in Windows by emplyoing a few really, really crude hacks. MacOS X, no idea how to do this.
So what's probably the easiest to do is using a Pixel Buffer Object to gain performant access to the rendered picture. Then send it over to the other application through shared memory and upload it into a texture there (again through pixel buffer object).
In MacOS,you can use IOSurface to share framebuffer between two application.
In my understanding, you won't be able to share the objects between the process under Windows, unless it's a kernel mode object. Even the shared textures and contexts can create performance hits also it has give you the additional responsibility of syncing the SwapBuffer() calls. Especially under windows platform the OpenGL implementation is notorious.
In my opinion, you can relay on inter-process communication mechanisms like Events, mutex, window messages, pipes to sync the rendering. but just realize that there's a performance consideration on approaching in this way. Kernel mode objects are good but the transition to kernel each time has a cost of 100ms. Which is damns costly for a high performance rendering application. In my opinion you have to reconsider the multi-process rendering design.
On Linux, a solution is to use DMABUF, as explained in this blog: https://blaztinn.gitlab.io/post/dmabuf-texture-sharing/

Multiple contexts per application vs multiple applications per context

I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements. Currently, applications usually have their own context, meaning whatever data might be identical across different applications, it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls. From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design. So technically, everything that might be reused across applications, and especially heavy stuff like bitmap or geometry caches for text and UI, is just clogging the server with unnecessary transfers and memory usage.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements.
That depends on the task at hand. A small detour: Take a webbrowser for example, where JavaScript performs manipulations on the DOM; CSS transform and SVG elements define graphical elements. Each JavaScript called in response to an event may run as a separate thread/lighweight process. In a matter of sense the webbrowser is a rendering engine (heck they're internally even called rendering engines) for a whole bunch of applications.
And for that it's a good idea.
And in general display servers are a very good thing. Just have a look at X11, which has an incredible track record. These days Wayland is all the hype, and a lot of people drank the Kool-Aid, but you actually want the abstraction of a display server. However not for the reasons you thought. The main reason to have a display server is to avoid redundant code (not redundant data) and to have only a single entity to deal with the dirty details (color spaces, device physical properties) and provide optimized higher order drawing primitives.
But in regard with the direct use of OpenGL none of those considerations matter:
Currently, applications usually have their own context, meaning whatever data might be identical across different applications,
So? Memory is cheap. And you don't gain performance by coalescing duplicate data, because the only thing that matters for performance is the memory bandwidth required to process this data. But that bandwidth doesn't change because it only depends on the internal structure of the data, which however is unchanged by coalescing.
In fact deduplication creates significant overhead, since when one application made changes, that are not to affect the other application a copy-on-write operation has to be invoked which is not for free, usually means a full copy, which however means that while making the whole copy the memory bandwidth is consumed.
However for a small, selected change in the data of one application, with each application having its own copy the memory bus is blocked for much shorter time.
it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls.
Resource management and rendering normally do not interfere with each other. While the GPU is busy turning scalar values into points, lines and triangles, the driver on the CPU can do the housekeeping. In fact a lot of performance is gained by keeping making the CPU do non-rendering related work while the GPU is busy rendering.
From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design
Where did you read that? There's no such constraint/requirement on this in the OpenGL specifications and real OpenGL implementations (=drivers) are free to parallelize as much as they want.
just clogging the server with unnecessary transfers and memory usage.
Transfer happens only once, when the data gets loaded. Memory bandwidth consumption is unchanged by deduplication. And memory is so cheap these days, that data deduplication simply isn't worth the effort.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I think you completely misunderstand the nature of OpenGL. OpenGL is not a scene graph. There's no scene, there are mo models in OpenGL. Each applications has its own layout of data and eventually this data gets passed into OpenGL to draw pixels onto the screen.
To OpenGL however there are just drawing commands to turn arrays of scalar values into points, lines and triangles on the screen. There's nothing more to it.

opencl - resources for multiple commandqueues

I'm working on an application where I real-time process a video feed on my GPU and once in a while I need to do some resource extensive calculations on my GPU besides that. My problem now is that I want to keep my video processing at real-time speed while doing the extra work in parallel once it comes up.
The way I think this should be done is with two command-queues, one for the real time video processing and one for the extensive calculations. However, I have no idea how this will turn out with the computing resources of the GPU: will there be equally many workers assigned to the command-queues during parallel execution? (so I could expect a slowdown of about 50% of my real-time computations?) Or is it device dependent?
The OpenCL specification leaves it up to the vendor to decide how to balance execution resources between multiple command queues. So a vendor could implement OpenCL in such a way that causes the GPU to work on only one kernel at a time. That would be a legal implementation, in my opinion.
If you really want to solve your problem in a device-independent way, I think you need to figure out how to break up your large non-real-time computation into smaller computations.
AMD has some extensions (some of which I think got adopted in OpenCL 1.2) for device fission, which means you can reserve some portion of the device for one context and use the rest for others.

OpenGL multihead vsync with different refresh rates

How do I drive multiple displays (multihead) at different resolutions and refresh rates with OpenGL (on Windows 7) and still be able to share textures between the devices?
I have one multi-head gpu. It drives 4 heads. (It happens to be an AMD FirePro V7900 in case it matters.) The heads all share a "scene" (vertex and texture data, etc.), but I want render this scene each time a vsync occurs on the display (each head is essentially a different viewport). But the catch is that the different heads may be at different refresh rates. For example, some displays may be at 60Hz and some may be at 30Hz and some may be at 24Hz.
When I call SwapBuffers the call blocks, so I can't tell which head needs to be rendered to next. I was hoping for something like Direct3D9's IDirect3DSwapChain9::Present with D3DPRESENT_DONOTWAIT flag, and the associated D3DERR_WASSTILLDRAWING return value. Using that approach, I could determine which head to render to next. By round-robin polling the different heads until one succeeded. But I don't know what the equivalent approach is in OpenGL.
I've already discovered wglSwapIntervalEXT(1) to use vsync. And I can switch between HDC's to render to the different windows with a single HGLRC. But the refresh rate difference is messing me up.
I'm not sure what I can do to have a single HGLRC render all these displays at different refresh rates. I assume it has to be a single HGLRC to make efficient use of shared textures (and other resources)...correct me if I'm wrong. It's not interesting to me if the resources are duplicated with multiple HGLRC's because I would expect that would cut my memory down to 25% (4 heads on 1 GPU: so I don't want 4 copies of any resource.)
I'm open to the idea of using multiple threads, if that's what it takes.
Can someone tell me how to structure my main loop so that I can share resources but still drive the displays at their own refresh rates and resolutions?
You can share OpenGL buffers carrying data by sharing the contexts. The call in Windows is names wglShareLists.
Using that you can give each window it's own rendering context running in a own thread, while all of the contexts share their data. Multiple window V-Sync in fact one of the few cases where multithreaded OpenGL makes sense.
I have not done anything like this before.
Looks like you actaully need multiple threads to get independent refresh rates.
An OpenGL render context can be active in only one thread. One thread can only have one active render context. Therfore with multiple threads you will need multiple render contexts.
It is possible to share resources between OpenGL contexts. With this it is not necessary to store resources multiple times.

Drawing triangles with CUDA

I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.
I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.
My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.
But will it be fast enough? Maybe there is some better solution?
You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.
Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.
If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA:
http://www.ddj.com/architect/217200602
http://home.comcast.net/~tom_forsyth/larrabee/Standford%20Forsyth%20Larrabee%202010.zip (see the second part of the presentation)
http://graphics.stanford.edu/papers/mprast/
I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.
For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.
To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.
If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.
In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.
For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.
Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.
Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.
Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.
Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.

Resources