glReadPixels() slow on reading GL_DEPTH_COMPONENT - performance

My application is dependent on reading depth information back from the framebuffer. I've implemented this with glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data)
However this runs unreasonable slow, it brings my application from a smooth 30fps to a laggy 3fps. If I try to other dimensions or data to read back it runs on an acceptable level.
To give an overview:
No glReadPixels -> 30 frames per second
glReadPixels(0, 0, 1, 1, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 3 frames per second, not acceptable
Why should the last one be so slow compared to the other calls? Is there any way to remedy it?
width x height is approximately 100 x 1000, the call gets increasingly slower as I increase the dimensions.
I've also tried to use pixel buffer objects but this has no significant effect on performance, it only delays the slowness till the glMapBuffer() call.
(I've tested this on a MacBook Air nVidia 320m graphics OS X 10.6, strangely enough my old MacBook Intel GMA x3100 got ~15 fps reading the depth buffer.)
UPDATE: leaving GLUT_MULTISAMPLE out of the glutInitDisplayMode options made a world of difference bringing the application back to a smooth 20fps again. I don't know what the option does in the first place, can anyone explain?

If your main framebuffer is MSAA-enabled (GLUT_MULTISAMPLE is present), then 2 actual framebuffers are created - one with MSAA and one regular.
The first one is needed for you to fill. It contains front and back color surfaces, plus depth and stencil. The second one has to contain only color that is produced by resolving the corresponding MSAA surface.
However, when you are trying to read depth using glReadPixels the driver is forced to resolve the MSAA-enabled depth surface too, which probably causes your slowdown.

What is the storage format you chose for your depth buffer ?
If it is not GLfloat, then you're asking GL to convert every single depth in the depth buffer to float when reading it. (And it's the same for your 3rd bullet, with GL_RED. was your Color buffer a float buffer ?)

No matter it is GL_FLOAT or GL_UNSIGNED_BYTE, glReadPixels is still very slow. If you use PBO to get RGB value, it will be very fast.
When using PBO to handle RGB value, the CPU usage is 4%. But it will increase to 50% when handling depth value. I've tried GL_FLOAT, GL_UNSIGNED_BYTE, GL_UNSIGNED_INT, GL_UNSIGNED_INT_24_8. So I can conclude that PBO is useless for reading depth value

Related

OpenGL slow on windows

i am currently writing a small game engine using OpenGL.
The mesh data is uploaded to vbos using GL_STATIC_DRAW.
Since I read that glBindBuffer is rather slow, I tried to minimize its use by accumulating the informations needed for rendering and then rendering a vbo multiple times using only one glBindBuffer per vbo (sort of like batch-rendering?).
Here is the code I use for the actual rendering:
int lastID = -1;
for(list<R_job>::iterator it = jobs.begin(); it != jobs.end(); ++it) {
if(lastID == -1) {
lastID = *it->vboID;
glBindTexture(GL_TEXTURE_2D, it->texID);
glBindBuffer(GL_ARRAY_BUFFER, *it->vboID);
glVertexPointer(3, GL_FLOAT, 4*(3+2+3), 0);
glTexCoordPointer(2, GL_FLOAT, 4*(3+2+3), (void*)(4*3));
glNormalPointer(GL_FLOAT, 4*(3+2+3), (void*)(4*(3+2)));
}
if(lastID != *it->vboID) {
glBindTexture(GL_TEXTURE_2D, it->texID);
glBindBuffer(GL_ARRAY_BUFFER, *it->vboID);
glVertexPointer(3, GL_FLOAT, 4*(3+2+3), 0);
glTexCoordPointer(2, GL_FLOAT, 4*(3+2+3), (void*)(4*3));
glNormalPointer(GL_FLOAT, 4*(3+2+3), (void*)(4*(3+2)));
lastID = *it->vboID;
}
glPushMatrix();
glMultMatrixf(value_ptr(*it->mat)); //the model matrix
glDrawArrays(GL_TRIANGLES, 0, it->size); //render
glPopMatrix();
}
The list is sorted by the id of the vbos. The data is interleaved.
My question is about speed. This code can render about 800 vbos (being the same, only drawArrays is called multiple times) at 30 fps on my 2010 macbook. On my PC (Phenom II X4 955 / HD 5700) at only 400 calls the fps go lower than 30. Could someone explain this to me? I hoped for a speedup on my pc. I am also using GLFW, GLEW, Xcode and VS2012 for each machine.
EDIT:
The mesh I am rendering has about 600 verts.
I'm sure you were running in Release mode, but you might also consider running in Release mode without the debugger attached. I think you will find that doing so will solve your performance issues with list::sort. In my experience the VS debugger can make a significant performance impact when it's attached - far more so than gdb.
2000 entities is a reasonable place to start seeing some FPS drop. At that point you are making nearly 10,000 API calls per frame. To comfortably get higher than that, you will need to start doing something like instancing, so that you are drawing multiple entities with one call.
Finally, I would like to say that glBindBuffer is not an expensive operation, and not really something that you should be batching to avoid. If you are going to batch, batch to avoid changing shaders and shader state (uniforms). But don't batch just to avoid changing buffer objects.
So I guess I found the answer.
Enabling VSync in AMD Control Center increased the fps greatly (probably because of GLFWs' swapBuffers), though they were still below the values of my macbook. After some debugging I found that list::sort is somehow VERY slow on windows and was pulling my fps down...
So right now(without list::sort), the fps drop(<60) at around 2000 Entities: 600*2000 = 1 200 000
Should that be my graphics cards limit?

OpenGL rendering performance - default shader or no shader?

I have some code for rendering a video, so the OpenGL side of it (once the rendered frame is available in the target texture) is very simple: Just render it to the target rectangle.
What complicates things a bit is that I am using a third-party SDK to render the UI, so I cannot know what state changes it makes, and therefore every time I am rendering a frame I have to make sure all the states I need are set correctly.
I am using a vertex and a texture coordinate buffer to draw my rectangle like this:
glActiveTexture(GL_TEXTURE0);
glEnable(GL_TEXTURE_RECTANGLE_ARB);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texHandle);
glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);
glPushClientAttrib( GL_CLIENT_VERTEX_ARRAY_BIT );
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_TEXTURE_COORD_ARRAY );
glBindBuffer(GL_ARRAY_BUFFER, m_vertexBuffer);
glVertexPointer(4, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, m_texCoordBuffer);
glTexCoordPointer(2, GL_FLOAT, 0, 0);
glDrawArrays(GL_QUADS, 0, 4);
glPopClientAttrib();
(Is there anything that I can skip - even when not knowing what is happening inside the UI library?)
Now I wonder -and this is more theoretical as I suppose there won't be much difference when just drawing one Quad- if it is theoretically faster to just render like the above, or instead write a simple default vertex and fragment shader that does maybe nothing more than returning ftransform() for the position and uses the default way for the fragment color too?
I wonder if by using a shader I can skip certain state changes, or generally speed up things. Or if by using the above code OpenGL internally just does that and the outcome will be exactly the same?
If you are worried about clobbering the UI SDK state, you should wrap the code with glPushAttrib(GL_ENABLE_BIT | GL_TEXTURE_BIT) ... glPopAttrib() as well.
You could simplify the state management code a bit by using a vertex array object.
As to using a shader, for this simple program I wouldn't bother. It would be one more bit of state you'd have to save & restore, and you're right that internally OpenGL is probably doing just that for the same outcome.
On speeding things up: performance is going to be dominated by the cost of sending tens? hundreds? of kilobytes of video frame data to the GPU, and adding or removing OpenGL calls is very unlikely to make a difference. I'd look first at possible differences in frame rate between the UI and video stream: for example, if the frame rate is faster, arrange for the video data to be copied once and re-used instead of copying it each time the UI is redrawn.
Hope this helps.

OpenGL ES rendering to user-space memory

I need to implement off-screen rendering to texture on an ARM device with PowerVR SGX hardware.
Everything is done (pixelbuffers and OpenGL ES 2.0 API were used). The only problem unsolved is very slow glReadPixels function.
I'm not an expert in OpenGL ES, so I'm asking community: is it possible to render textures directly into user-space memory? Or may be there is some way to get hardware address of texture's memory region? Some other technique (EGL extensions)?
I don't need an universal solution, just working one for PowerVR hardware.
Update: A little more information on 'slow function glReadPixels'. Copy 512x512 RGB texture data to CPU's memory:
glReadPixels(0, 0, WIDTH, HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, &arr) takes 210 ms,
glReadPixels(0, 0, WIDTH, HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE, &arr) takes 24 ms (GL_BGRA is not standard for glReadPixels, it's PoverVR extension),
memcpy(&arr, &arr2, WIDTH * HEIGHT * 4) takes 5 ms
In case of bigger textures, differences are bigger too.
Solved.
The way how to force OpenVR hardware render into user-allocated memory:
http://processors.wiki.ti.com/index.php/Render_to_Texture_with_OpenGL_ES#Pixmaps
An example, how to use it:
https://gforge.ti.com/gf/project/gleslayer/
After all of this I can get rendered image as faster as 5 ms.
When you call opengl functions, you're queuing commands in a render queue. Those commands are executed by the GPU asynchronously. When you call glReadPixels, the cpu must wait the gpu to finish its rendering. So the call might be waiting for that draw to finish. On most hardware ( at least those I work on ), the memory is shared by the cpu and the gpu, so the read pixel should not be that slow if the rendering is done.
If you can wait the result or deferred it to the next frame, you might not see that delay anymore
Frame buffer objects are what you are looking for. They are supported on OpenGL ES, and on PowerVr-SGX
EDIT:
Keep in mind that GPU/CPU hardware is incredibly optimized towards moving data in one direction from CPU side to GPU side. The backpath from GPU to CPU is often much slower (its just not a priority to spend hardware resources on). So what ever technique you use (eg FBO/getTexImage) you're going to run against this limit.

glTexImage2D Segfault related to width/height

I got a segfault when I tried to load a 771x768 image.
Tried with a 24x24 and 768x768 image and they worked, no problem.
Is this expected? Why wouldn't it just fail gracefully with a GL Error?
The segmentation fault occurs in the glTexImage2D call. I am loading a PPM binary file so it is packed 24 bits per pixel. This odd number combined with an odd dimension probably produces a not-4-byte (or even 2-byte) aligned structure (and referencing outside of my exactly enough allocated buffer may be the cause of the error but gdb does not show me a memory address (which I could use to find out if this is what causes it)).
glTexImage2D(GL_TEXTURE_2D, 0, 3, width, height, 0, GL_RGB, GL_UNSIGNED_BYTE, dataptr);
// in this specific case of failure, width = 771, height = 768,
// dataptr contains 1776384 bytes of binary RGB image data (771*768*3 = 1776384)
This odd number combined with an odd dimension probably produces a not-4-byte (or even 2-byte) aligned structure (and referencing outside of my exactly enough allocated buffer may be the cause of the error
This is likely the cause. Luckily you can set the alignment OpenGL uses reading pixel data. Right before calling glTexImage…(…) do
glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
glPixelStorei(GL_UNPACK_ROW_LENGTH, 0);
glPixelStorei(GL_UNPACK_SKIP_PIXELS, 0);
glPixelStorei(GL_UNPACK_SKIP_ROWS, 0);
I've read this in the opengl forums:
width must be 2^m + 2(border) for some integer m.
height must be 2^n + 2(border) for some integer n.
(source)
I found this which I believe it clarifies what's happening:
1. What should this extension be called?
STATUS: RESOLVED
RESOLUTION: ARB_texture_non_power_of_two. Conventional OpenGL
textures are restricted to size dimensions that are powers of two.
from GL_ARB_texture_non_power_of_two

glFlush() takes very long time on window with transparent background

I used the code from How to make an OpenGL rendering context with transparent background? to create a window with transparent background. My problem is that the frame rate is very low - I have around 20 frames/sec even when I draw one quad(made from 2 triangles). I tried to find out why and glFlush() takes around 0.047 seconds. Do you have any idea why? Same thing is rendered in a window that does not have transparent background at 6000 fps(when I remove 60 fps limitation). It also takes one core to 100%. I test it on a Q9450#2.66GHz with ATI Radeon 4800 using Win7.
I think you can't get good performances this way, In the example linked there is the following code
void draw(HDC pdcDest)
{
assert(pdcDIB);
verify(BitBlt(pdcDest, 0, 0, w, h, pdcDIB, 0, 0, SRCCOPY));
}
BitBlt is a function executed on the processor, whereas the OpenGL functions are executed by the GPU. So the rendered data from the GPU as to crawl back to the main memory, and effectively the bandwidth from the GPU to the CPU is somewhat limited (even more because data as to go back there once BitBlt'ed).
If you really want transparent window with rendered content, you might want to look at Direct2D and/or Direct3D, maybe there is some way to do that without the performance penalty of data moving.

Resources