OpenGL Performance - performance

First let me explain the application a little bit. This is video security software that can display up to 48 cameras at once. Each video stream gets its own Windows HDC but they all use a shared OpenGL context. I get pretty good performance with OpenGL and it runs on Windows/Linux/Mac. Under the hood the contexts are created using wxWidgets 2.8 wxGLCanvas, but I don't think that has anything to do with the issue.
Now here's the issue. Say I take the same camera and display it in all 48 of my windows. This basically means I'm only decoding 30fps (which is done on a different thread anywa) but displaying up to 1440fps to take decoding out of the picture. I'm using PBOs to transfer the images over, depending on whether pixel shaders and multitexturing are supported I may use those to do YUV->RGB conversion on the GPU. Then I use a quad to position the texture and call SwapBuffers. All the OpenGL calls come from the UI thread. Also I've tried doing YUV->RGB conversion on the CPU and messed with using GL_RGBA and GL_BGRA textures, but all formats still yield roughly the same performance. Now the problem is I'm only getting around 1000fps out of the possible 1440fps (I know I shouldn't be measuring in fps, but its easier in this scenario). The above scenario is using 320x240 (YUV420) video which is roughly only 110MB/sec. If I use a 1280x720 camera then I get roughly the same framerate which is nearly 1.3GB/sec. This tells me that it certainly isn't the texture upload speed. If I do the YUV->RGB conversion and scaling on the CPU and paint using a Windows DC then I can easily get the full 1440fps.
The other thing to mention is that I've disabled vsync both on my video card and through OpenGL using wglSwapIntervalEXT. Also there are no OpenGL errors being reported. However, using very sleepy to profile the application it seems to be spending most of its time in SwapBuffers. I'm assuming the issue is somehow related to my use of multiple HDCs or with SwapBuffers somewhere, however, I'm not sure how else to do what I'm doing.
I'm no expert on OpenGL so if anyone has any suggestions or anything I would love to hear them. If there is anything that I'm doing that sounds wrong or any way I could achieve the same thing more efficiently I'd love to hear it.
Here's some links to glIntercept logs for a better understanding of all the OpenGL calls being made:
Simple RGB: https://docs.google.com/open?id=0BzGMib6CGH4TdUdlcTBYMHNTRnM
Shaders YUV: https://docs.google.com/open?id=0BzGMib6CGH4TSDJTZGxDanBwS2M
Profiling Information:
So after profiling it reported several redundant state changes which I'm not surprised by. I eliminated all of them and saw no noticeable performance difference which I kind of expected. I have 34 state changes per render loop and I am using several deprecated functions. I'll look into using vertex arrays which would solve these. However, I'm just doing one quad per render loop so I don't really expect much performance impact from this. Also keep in mind I don't want to rip everything out and go all VBOs because I still need to support some fairly old Intel chipset drivers that I believe are only OpenGL 1.4.
The thing that really interested me and it hadn't occurred to me before was that each context has its own front and back buffer. Since I'm only using one context the previous HDCs render call must finish writing to the back buffer before the swap can occur and then the next one can start writing to the back buffer again. Would it really be more efficient to use more than one context? Or should I look into rendering to textures (FBOs I think) instead and continue using one context?
EDIT: The original description mentioned using multiple OpenGL contexts, but I was wrong I'm only using one OpenGL context and multiple HDCs.
EDIT2: Added some information after profiling with gDEBugger.

What I try to make your application faster. I made one OpenGL render thread (or more if you have 2 or more video cards). Video card cannot process several context in one time, your multiple OpenGL contexts are waiting one of context. This thread will make only OpenGL work, like YUV->RGB conversion (Used FBO to render to texture). Camere`s thread send images to this thread and UI thread can picked up it to show on window.
You have query to process in OpenGL context and you can combine several frames to one texture to convert it by one pass. It maybe useful, because you have up to 48 cameras. As another variant if OpenGL thread is busy now, you can convert some frame on CPU.
From the log I see you often call the same methods:
glEnable(GL_TEXTURE_2D)
glMatrixMode(GL_TEXTURE)
glLoadIdentity()
glColor4f(1.000000,1.000000,1.000000,1.000000)
You may call it once per context and did not call for each render.
If I understung correct you use 3 texture for each plane of YUV
glTexSubImage2D(GL_TEXTURE_2D,0,0,0,352,240,GL_LUMINANCE,GL_UNSIGNED_BYTE,00000000)
glTexSubImage2D(GL_TEXTURE_2D,0,0,0,176,120,GL_LUMINANCE,GL_UNSIGNED_BYTE,000000)
glTexSubImage2D(GL_TEXTURE_2D,0,0,0,176,120,GL_LUMINANCE,GL_UNSIGNED_BYTE,00000000)
Try to use one texture and use calculation in shader to take correct YUV value for pixel. It is possible, I made it in my application.

Related

Vulkan/OpenGL subpasses that fetch more than single fragment

So, Vulkan introduced subpasses and opengl implelemts similar behaviour with ARM_framebuffer_fetch
In the past, I have used framebuffer_fetch successfully for tonemapping post-effect shaders.
Back then the limitation was that one could only read the contents of the framebuffer at the location of the currently rendered fragment.
Now, what I wonder is whether there is any way by now in Vulkan (or even OpenGL ES) to read from multiple locations (for example to implement a blur kernel) without having a tiled hardware to store/load to RAM.
In theory I guess it should be possible, the first pass wpuld just need to render slightly larger than the blur subpass, based on kernel size (so for example if kernel size was 4 pixels then the tile resolved would need to be 4 pixels smaller than the in-tile buffer sizes) and some pixels would have to be rendered redundantly (on the overlaps of tiles).
Now, is there a way to do that?
I seem to recall having seen some Vulkan instruction related to subpasses that would allow to define the support size (which sounded like what I’m looking for now) but I can’t recall where I saw that.
So my questions:
With Vulkan on a mobile tiled renderer architecture, is it possible to forward-render some geometry and the render a full-screen blur over it, all within a single in-tile pass (without the hardware having to store the result of the intermediate pass to ram first and then load the texture from ram when bluring)? If so, how?
If the answer to 1 is yes, can it also be done in OpenGL ES?
Short answer, no. Vulkan subpasses still have the 1:1 fragment-to-pixel association requirements.

What is the WEBGL version/level required for the smaa post-processing example?

The SMAA post-processing example is by far the best antialiasing method in my tests, but it's extremely complex and I'm worried that most-likely it's not WEBGL-1.0, so it won't run on older PCs and devices at all.
Anyone knows what version is it?
And what is the actual load on GPU, is there a tool to inspect the milliseconds per frame? Relying just on dropped framerate is next to useless.
You can use the SMAAPass with WebGL 1. WebGL 2 is not even supported by three.js. Looking at the respective shader code, I would say it should also compile on older hardware.
I don't have any concrete performance measurements but I can guarantee that this pass will add noticeable overhead to your application, especially on mobile devices.
three.js R92

glBufferData very slow with big textures (sprites sheets) in Cocos2d-x 3.3

I'm working with Cocos2d-x to port my PC game to Android.
For the sprites part, I wanted to optimize the rendering process so I decided to dynamically create sprites sheets that contain the frames for all the sprites.
Unfortunately, this makes the rendering process about 10-15 times slower than using small textures containing only the frames for the current sprite (on mobile device, on Windows everything runs smoothly).
I initially thought it could be related to the switching between the sheets (big textures like 4096*4096) when the rendering process would display one sprite from one sheet, then another from another sheet and so on... making a lot of switches between huge textures.
So I sorted the sprites before "putting" their frames in the sprites sheets, and I can confirm that the switches are now non-existent.
After a long investigation, profiling, tests etc... I finally found that one Open GL function takes all the time:
glBufferData(GL_ARRAY_BUFFER, sizeof(_quadVerts[0]) * _numberQuads * 4, _quadVerts, GL_DYNAMIC_DRAW);
Calling this function takes a long time (profiler says more than 20 ms per call) if I use the big texture, quite fast if I use small ones (about 2 ms).
I don't really know Open GL, I'm using it because Cocos2d-x uses it, and I'm not at ease to try to debug/optimize the engine because I really think they are far better than me for that :)
I might be misunderstanding something and I'm stuck on this since several days and I have no idea of what I can do now.
Any clues ?
Note: I'm talking about glBufferData but I have the same issue with glBindFramebuffer, very slow with big textures. I assume this is all the same topic.
Thanks
It is normally a costly call to do as glBufferData involves CPU to GPU transfer.
But the logic behind Renderer::drawBatchedQuads is to flush the quads that have been buffered in a temporary array. The more quads you have to render, the more data have to be transferred.
Since the quads properties (positions, texture, colors) are likely to change each frame, a CPU to GPU transfer is required every frame as hinted by the flag GL_DYNAMIC_DRAW.
According to specs:
GL_DYNAMIC_DRAW: The data store contents will be modified repeatedly and used many times as the source for GL drawing command.
There are possible alternatives to glBufferData such as glMapBuffer or glBufferSubData that could be used for comparison.

Cocos2d 2.0 and OpenGL analyizer suggestions

I have analyzed my game running OpenGL Analyzer on XCode. I am using Cococs2d 2.0 as static library in my game and wonder whether any of the following suggestions will improve my performance. I have read some post in other forums saying that I should not worry about this but as I do have some performance issues I would like to understand if those suggestion will be likely to improve them.
Suggestions:
Overview:
Thinking:
In particular I refer to the suggestion where it says:
"reccomended using VAO and VBO"
Then I wonder also why there are "Many small batch draw calls". I am using a spritebatch node and this should avoid this issue.
Also the other suggestions seems to make sense, but those are the most "frequent" ones so would like to start analyzing those.
A "small batch draw call" is anything with fewer than n-many vertices. I am not sure the exact threshold used, but it is probably on the order of 100-200. What spritebatches really do is eliminate the need to split your draw calls up multiple times in order to switch bound textures, this does not automatically imply that each draw call is going to have more than 100 (or whatever n is defined as in this context) vertices; it is a strong possibility, but not necessary.
I would be far more concerned about non-VBO draw calls and not using VAOs to be honest, especially if you want your code to be forward-compatible.
The "Logical Buffer Load" and "Mipmapping Usage" warnings are very likely related; probably both having to do with FBOs. One of them is related to not using glClear (...) properly and the other is related to using a texture that does not have mipmaps.
Regarding logical buffer loads, you should look into GL_EXT_discard_framebuffer, clearing the framebuffer this way is a really healthy optimization strategy for Tile-Based Deferred Rendering GPUs (such as the ones used by all iOS devices).
As for the mipmap usage warning, I believe this is being triggered because you are drawing into an FBO texture and then applying that texture using a mipmap filter mode. The mip-chain/pyramid for drawn FBOs has to be built manually using glGenerateMipamp (...).
If you can point me to some individual lines that trigger these warnings, I would be happy to explain them in further detail for you.

How to work with pixels using Direct2D

Could somebody provide an example of an efficient way to work with pixels using Direct2D?
For example, how can I swap all green pixels (RGB = 0x00FF00) with red pixels (RGB = 0xFF0000) on a render target? What is the standard approach? Is it possible to use ID2D1HwndRenderTarget for that? Here I assume using some kind of hardware acceleration. Should I create a different object for direct pixels manipulations?
Using DirectDraw I would use BltFast method on the IDirectDrawSurface7 with logical operation. Is there something similar with Direct2D?
Another task is to generate complex images dynamically where each point location and color is a result of a mathematical function. For the sake of an example let's simplify everything and draw Y = X ^ 2. How to do that with Direct2D? Ultimately I'm going to need to draw complex functions but if somebody could give me a simple example for Y = X ^ 2.
First, it helps to think of ID2D1Bitmap as a "device bitmap". It may or may not live in local, CPU-addressable memory, and it doesn't give you any convenient (or at least fast) way to read/write the pixels from the CPU side of the bus. So approaching from that angle is probably the wrong approach.
What I think you want is a regular WIC bitmap, IWICBitmap, which you can create with IWICImagingFactory::CreateBitmap(). From there you can call Lock() to get at the buffer, and then read/write using pointers and do whatever you want. Then, when you need to draw it on-screen with Direct2D, use ID2D1RenderTarget::CreateBitmap() to create a new device bitmap, or ID2D1Bitmap::CopyFromMemory() to update an existing device bitmap. You can also render into an IWICBitmap by making use of ID2D1Factory::CreateWicBitmapRenderTarget() (not hardware accelerated).
You will not get hardware acceleration for these types of operations. The updated Direct2D in Win8 (should also be available for Win7 eventually) has some spiffy stuff for this but it's rather complex looking.
Rick's answer talks about the methods you can use if you don't care about losing hardware acceleration. I'm focusing on how to accomplish this using a substantial amount of GPU acceleration.
In order to keep your rendering hardware accelerated and to get the best performance, you are going to want to switch from ID2DHwndRenderTarget to using the newer ID2DDevice and ID2DDeviceContext interfaces. It honestly doesn't add that much more logic to your code and the performance benefits are substantial. It also works on Windows 7 with the Platform Update. To summarize the process:
Create a DXGI factory when you create your D2D factory.
Create a D3D11 device and a D2D device to match.
Create a swap chain using your DXGI factory and the D3D device.
Ask the swap chain for its back buffer and wrap it in a D2D bitmap.
Render like before, between calls to BeginDraw() and EndDraw(). Remember to unbind the back buffer and destroy the D2D bitmap wrapping it!
Call Present() on the swap chain to see the results.
Repeat from 4.
Once you've done that, you have unlocked a number of possible solutions. Probably the simplest and most performant way to solve your exact problem (swapping color channels) is to use the color matrix effect as one of the other answers mentioned. It's important to recognize that you need to use the newer ID2DDeviceContext interface rather than the ID2DHwndRenderTarget to get this however. There are lots of other effects that can do more complicated operations if you so choose. Here are some of the most useful ones for simple pixel manipulation:
Color matrix effect
Arithmetic operation
Blend operation
For generally solving the problem of manipulating the pixels directly without dropping hardware acceleration or doing tons of copying, there are two options. The first is to write a pixel shader and wrap it in a completely custom D2D effect. It's more work than just getting the pixel buffer on the CPU and doing old-fashioned bit mashing, but doing it all on the GPU is substantially faster. The D2D effects framework also makes it super simple to reuse your effect for other purposes, combine it with other effects, etc.
For those times when you absolutely have to do CPU pixel manipulation but still want a substantial degree of acceleration, you can manage your own mappable D3D11 textures. For example, you can use staging textures if you want to asynchronously manipulate your texture resources from the CPU. There is another answer that goes into more detail. See ID3D11Texture2D for more information.
The specific issue of swapping all green pixels with red pixels can be addressed via ID2D1Effect as of Windows 8 and Platform Update for Windows 7.
More specifically, Color matrix effect.

Resources