StretchBlt is too slow, anyway faster? - performance

I'm using StretchBlt to draw a resized real-time video.
::SetStretchBltMode(hDC, HALFTONE);
::StretchBlt(hDc, 0, 0, 1225, 689, hwNd, 0, 0, 1364, 768, SRCCOPY);
However, the StretchBlt api is too slow. It's take about 100ms on my computer each time StretchBlt is executed. Is there any other API or any way to improve the speed?

Yes, using HW accelerated video processing:
Read more on IDirectXVideoProcessor::VideoProcessBlt
Unfortunately, this is a wide topic but you can read online and find samples on how to use it.

Related

OpenGL slow on windows

i am currently writing a small game engine using OpenGL.
The mesh data is uploaded to vbos using GL_STATIC_DRAW.
Since I read that glBindBuffer is rather slow, I tried to minimize its use by accumulating the informations needed for rendering and then rendering a vbo multiple times using only one glBindBuffer per vbo (sort of like batch-rendering?).
Here is the code I use for the actual rendering:
int lastID = -1;
for(list<R_job>::iterator it = jobs.begin(); it != jobs.end(); ++it) {
if(lastID == -1) {
lastID = *it->vboID;
glBindTexture(GL_TEXTURE_2D, it->texID);
glBindBuffer(GL_ARRAY_BUFFER, *it->vboID);
glVertexPointer(3, GL_FLOAT, 4*(3+2+3), 0);
glTexCoordPointer(2, GL_FLOAT, 4*(3+2+3), (void*)(4*3));
glNormalPointer(GL_FLOAT, 4*(3+2+3), (void*)(4*(3+2)));
}
if(lastID != *it->vboID) {
glBindTexture(GL_TEXTURE_2D, it->texID);
glBindBuffer(GL_ARRAY_BUFFER, *it->vboID);
glVertexPointer(3, GL_FLOAT, 4*(3+2+3), 0);
glTexCoordPointer(2, GL_FLOAT, 4*(3+2+3), (void*)(4*3));
glNormalPointer(GL_FLOAT, 4*(3+2+3), (void*)(4*(3+2)));
lastID = *it->vboID;
}
glPushMatrix();
glMultMatrixf(value_ptr(*it->mat)); //the model matrix
glDrawArrays(GL_TRIANGLES, 0, it->size); //render
glPopMatrix();
}
The list is sorted by the id of the vbos. The data is interleaved.
My question is about speed. This code can render about 800 vbos (being the same, only drawArrays is called multiple times) at 30 fps on my 2010 macbook. On my PC (Phenom II X4 955 / HD 5700) at only 400 calls the fps go lower than 30. Could someone explain this to me? I hoped for a speedup on my pc. I am also using GLFW, GLEW, Xcode and VS2012 for each machine.
EDIT:
The mesh I am rendering has about 600 verts.
I'm sure you were running in Release mode, but you might also consider running in Release mode without the debugger attached. I think you will find that doing so will solve your performance issues with list::sort. In my experience the VS debugger can make a significant performance impact when it's attached - far more so than gdb.
2000 entities is a reasonable place to start seeing some FPS drop. At that point you are making nearly 10,000 API calls per frame. To comfortably get higher than that, you will need to start doing something like instancing, so that you are drawing multiple entities with one call.
Finally, I would like to say that glBindBuffer is not an expensive operation, and not really something that you should be batching to avoid. If you are going to batch, batch to avoid changing shaders and shader state (uniforms). But don't batch just to avoid changing buffer objects.
So I guess I found the answer.
Enabling VSync in AMD Control Center increased the fps greatly (probably because of GLFWs' swapBuffers), though they were still below the values of my macbook. After some debugging I found that list::sort is somehow VERY slow on windows and was pulling my fps down...
So right now(without list::sort), the fps drop(<60) at around 2000 Entities: 600*2000 = 1 200 000
Should that be my graphics cards limit?

OpenGL ES rendering to user-space memory

I need to implement off-screen rendering to texture on an ARM device with PowerVR SGX hardware.
Everything is done (pixelbuffers and OpenGL ES 2.0 API were used). The only problem unsolved is very slow glReadPixels function.
I'm not an expert in OpenGL ES, so I'm asking community: is it possible to render textures directly into user-space memory? Or may be there is some way to get hardware address of texture's memory region? Some other technique (EGL extensions)?
I don't need an universal solution, just working one for PowerVR hardware.
Update: A little more information on 'slow function glReadPixels'. Copy 512x512 RGB texture data to CPU's memory:
glReadPixels(0, 0, WIDTH, HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, &arr) takes 210 ms,
glReadPixels(0, 0, WIDTH, HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE, &arr) takes 24 ms (GL_BGRA is not standard for glReadPixels, it's PoverVR extension),
memcpy(&arr, &arr2, WIDTH * HEIGHT * 4) takes 5 ms
In case of bigger textures, differences are bigger too.
Solved.
The way how to force OpenVR hardware render into user-allocated memory:
http://processors.wiki.ti.com/index.php/Render_to_Texture_with_OpenGL_ES#Pixmaps
An example, how to use it:
https://gforge.ti.com/gf/project/gleslayer/
After all of this I can get rendered image as faster as 5 ms.
When you call opengl functions, you're queuing commands in a render queue. Those commands are executed by the GPU asynchronously. When you call glReadPixels, the cpu must wait the gpu to finish its rendering. So the call might be waiting for that draw to finish. On most hardware ( at least those I work on ), the memory is shared by the cpu and the gpu, so the read pixel should not be that slow if the rendering is done.
If you can wait the result or deferred it to the next frame, you might not see that delay anymore
Frame buffer objects are what you are looking for. They are supported on OpenGL ES, and on PowerVr-SGX
EDIT:
Keep in mind that GPU/CPU hardware is incredibly optimized towards moving data in one direction from CPU side to GPU side. The backpath from GPU to CPU is often much slower (its just not a priority to spend hardware resources on). So what ever technique you use (eg FBO/getTexImage) you're going to run against this limit.

glReadPixels() slow on reading GL_DEPTH_COMPONENT

My application is dependent on reading depth information back from the framebuffer. I've implemented this with glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data)
However this runs unreasonable slow, it brings my application from a smooth 30fps to a laggy 3fps. If I try to other dimensions or data to read back it runs on an acceptable level.
To give an overview:
No glReadPixels -> 30 frames per second
glReadPixels(0, 0, 1, 1, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 3 frames per second, not acceptable
Why should the last one be so slow compared to the other calls? Is there any way to remedy it?
width x height is approximately 100 x 1000, the call gets increasingly slower as I increase the dimensions.
I've also tried to use pixel buffer objects but this has no significant effect on performance, it only delays the slowness till the glMapBuffer() call.
(I've tested this on a MacBook Air nVidia 320m graphics OS X 10.6, strangely enough my old MacBook Intel GMA x3100 got ~15 fps reading the depth buffer.)
UPDATE: leaving GLUT_MULTISAMPLE out of the glutInitDisplayMode options made a world of difference bringing the application back to a smooth 20fps again. I don't know what the option does in the first place, can anyone explain?
If your main framebuffer is MSAA-enabled (GLUT_MULTISAMPLE is present), then 2 actual framebuffers are created - one with MSAA and one regular.
The first one is needed for you to fill. It contains front and back color surfaces, plus depth and stencil. The second one has to contain only color that is produced by resolving the corresponding MSAA surface.
However, when you are trying to read depth using glReadPixels the driver is forced to resolve the MSAA-enabled depth surface too, which probably causes your slowdown.
What is the storage format you chose for your depth buffer ?
If it is not GLfloat, then you're asking GL to convert every single depth in the depth buffer to float when reading it. (And it's the same for your 3rd bullet, with GL_RED. was your Color buffer a float buffer ?)
No matter it is GL_FLOAT or GL_UNSIGNED_BYTE, glReadPixels is still very slow. If you use PBO to get RGB value, it will be very fast.
When using PBO to handle RGB value, the CPU usage is 4%. But it will increase to 50% when handling depth value. I've tried GL_FLOAT, GL_UNSIGNED_BYTE, GL_UNSIGNED_INT, GL_UNSIGNED_INT_24_8. So I can conclude that PBO is useless for reading depth value

glFlush() takes very long time on window with transparent background

I used the code from How to make an OpenGL rendering context with transparent background? to create a window with transparent background. My problem is that the frame rate is very low - I have around 20 frames/sec even when I draw one quad(made from 2 triangles). I tried to find out why and glFlush() takes around 0.047 seconds. Do you have any idea why? Same thing is rendered in a window that does not have transparent background at 6000 fps(when I remove 60 fps limitation). It also takes one core to 100%. I test it on a Q9450#2.66GHz with ATI Radeon 4800 using Win7.
I think you can't get good performances this way, In the example linked there is the following code
void draw(HDC pdcDest)
{
assert(pdcDIB);
verify(BitBlt(pdcDest, 0, 0, w, h, pdcDIB, 0, 0, SRCCOPY));
}
BitBlt is a function executed on the processor, whereas the OpenGL functions are executed by the GPU. So the rendered data from the GPU as to crawl back to the main memory, and effectively the bandwidth from the GPU to the CPU is somewhat limited (even more because data as to go back there once BitBlt'ed).
If you really want transparent window with rendered content, you might want to look at Direct2D and/or Direct3D, maybe there is some way to do that without the performance penalty of data moving.

How to improve the copy speed from D3D surface back to system memory

I'm using the following codes to copy D3D surface back to system memory, but the performance is bad when call LockRect operation, it spends lot of time of this function. Is there a way to improve it? Thanks in advance.
Below is sample codes.
D3DDev->GetRenderTargetData(renderTarget, offscreenSurface);
// Lock the surface to read pixels
offscreenSurface->LockRect( &lr, &rect, D3DLOCK_READONLY );
What D3D version?
You can create a render target with HDC support, get it's surface and use surface->GetHDC() afterwards. I used this trick instead of 'LockRect', it gave acceptable performance for capturing D3D data to use with regular GDI or I/O.

Resources