glTexSubImage2D extremely slow on Intel video card

glTexSubImage2D extremely slow on Intel video card - performance

My video card is Mobile Intel 4 Series. I'm updating a texture with changing data every frame, here's my main loop:
for(;;) {
Timer timer;
glBindTexture(GL_TEXTURE2D, tex);
glBegin(GL_QUADS); ... /* draw textured quad */ ... glEnd();
glTexSubImage2D(GL_TEXTURE2D, 0, 0, 0, 512, 512,
GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, data);
swapBuffers();
cout << timer.Elapsed();
}
Every iteration takes 120ms. However, inserting glFlush before glTexSubImage2D brings the iteration time to 2ms.
The issue is not in the pixel format. I've tried the pixel formats BGRA, RGBA and ABGR_EXT together with the pixel types UNSIGNED_BYTE, BYTE, UNSIGNED_INT_8_8_8_8 and UNSIGNED_INT_8_8_8_8_EXT. The texture's internal pixel format is RGBA.
The order of calls matters. Moving the texture upload before the quad drawing, for example, fixes the slowness.
I also tried this on an GeForce GT 420M card, and it works fast there. My real app does have performance problems on non-Intel cards that are fixed by glFlush calls, but I haven't distilled those to a test case yet.
Any ideas on how to debug this?

One issue is that glTexImage2D performs a full reinitialization of the texture object. If only the data changes, but the format remains the same, use glTexSubImage2D to speed things up (just a reminder).
The other issue is, that despite its name the immediate mode, i.e. glBegin(…) … glEnd() the drawing calls are not synchronous, i.e. the calls return long before the GPU is done drawing. Adding a glFinish() will synchronize. But as well will do calls to anything that modifies data still required by queued operations. So in your case glTexImage2D (and glTexSubImage2D) must wait for the drawing to finish.
Usually it's best to do all volatile resource uploads at either the beginning of the drawing function, or during the SwapBuffers block in a separate thread through buffer objects. Buffer objects have been introduced for that very reason, to allow for asynchronous, yet tight operation.

I assume you're actually using that texture for one or more of your quads?
Uploading textures is one of the most expensive operations possible. Since your texture data changes every frame, the upload is unavoidable, but you should try to do it when the texture isn't in use by shaders. Remember that glBegin(GL_QUADS); ... glEnd(); doesn't actually draw quads, it requests that the GPU render the quads. Until the rendering completes, the texture will be locked. Depending on the implementation, this might cause the texture upload to wait (ala glFlush), but it could also cause the upload to fail, in which case you've wasted megabytes of PCIe bandwidth and the driver has to retry.
It sounds like you already have a solution: upload all new textures at the beginning of the frame. So what's your question?
NOTE: Intel integrated graphics are horribly slow anyway.

When you make a Draw Call ( glDrawElements, other ), the driver simply add this call in a buffer, and let the GPU consume these commands when it can.
If this buffer had to be consumed entirely at glSwapBuffers, this would mean that the GPU would be idle after that, waiting for you to send new commands.
Drivers solve this by letting the GPU lag one frame behind. This is the first reason why glTexSubImage2D blocks : the driver waits for the GPU not to use it anymore (in the previous frame) to begin the transfer, so that you never get half-updated data.
The other reason is that glTexSubImage2D is synchronous. Il will also block during the whole transfer.
You can solve the first issue by keeping 2 textures : one for the current frame, one for the previous frame. Upload the texture in the former, but draw with the latter.
You can solve the second issue by using a GL_TEXTURE_BUFFER Buffer Object, which allows asynchronous transfers.
In your case, I suspect that calling glTexSubImage2D just before glSwapBuffer adds an extra synchronization in the driver, whereas drawing the quad just before glSwapBuffer simply appends the command in the buffer. 120ms is probably a driver bug, though : even an Intel GMA doesn't need 120ms to upload a 512x512 texture.

Related

About the meaning of glInvalidateFramebuffer

I have a question about the general use of glInvalidateFramebuffer:
As far as I know, the purpose of glInvalidateFramebuffer is to "skip the store of framebuffer contents that are no longer needed". Its main purpose on tile based gpus is to get rid of depth and stencil contents if only color is needed after rendering. I do not understand why this is necessary. As far as I know if I render to an FBO then all of this data is stored in that FBO. Now if I do something with only the color contents or nothing with that FBO in a subsequent draw, why is the depth/stencil data accessed at all? It is supposedly stored somewhere and that eats bandwidth, but as far as I can tell it is already in FBOs GPU memory as the result of the render so when does that supposed expensive additional store operation happen?
There are supposedly expensive preservaton steps for FBO attachments but why are those necessary if the data is already in Gpu memory as result of the render?
Regards

Framebuffers in a tile-based GPU exist in two places - the version stored in main memory (which is persistent), and the working copy inside the GPU tile buffer (which only exists for the duration of that tile's fragment shading). The content of the tile buffer is written back to the buffer in main memory at the end of shading for each tile.
The objective of tile-based shading is to keep as much of the state inside that tile buffer, and avoid writing it back to main memory if it's not needed. This is important because main memory DRAM accesses are phenomenally power hungry. Invalidation at the end of each render pass tells the graphics stack that those buffers don't need to be persisted, so means the write back from tile buffer to main memory can be avoided.
I've written a longer article on it here if you want more detail:
https://developer.arm.com/solutions/graphics/developer-guides/understanding-render-passes/single-view
For non-tile-based GPUs the main use case seems to be using it as a lower cost version of a clear at the start of a render pass if you don't actually care about the starting color. It's likely there is little benefit to using it at the end of the render pass (but it should not hurt either).

glBufferData very slow with big textures (sprites sheets) in Cocos2d-x 3.3

I'm working with Cocos2d-x to port my PC game to Android.
For the sprites part, I wanted to optimize the rendering process so I decided to dynamically create sprites sheets that contain the frames for all the sprites.
Unfortunately, this makes the rendering process about 10-15 times slower than using small textures containing only the frames for the current sprite (on mobile device, on Windows everything runs smoothly).
I initially thought it could be related to the switching between the sheets (big textures like 4096*4096) when the rendering process would display one sprite from one sheet, then another from another sheet and so on... making a lot of switches between huge textures.
So I sorted the sprites before "putting" their frames in the sprites sheets, and I can confirm that the switches are now non-existent.
After a long investigation, profiling, tests etc... I finally found that one Open GL function takes all the time:
glBufferData(GL_ARRAY_BUFFER, sizeof(_quadVerts[0]) * _numberQuads * 4, _quadVerts, GL_DYNAMIC_DRAW);
Calling this function takes a long time (profiler says more than 20 ms per call) if I use the big texture, quite fast if I use small ones (about 2 ms).
I don't really know Open GL, I'm using it because Cocos2d-x uses it, and I'm not at ease to try to debug/optimize the engine because I really think they are far better than me for that :)
I might be misunderstanding something and I'm stuck on this since several days and I have no idea of what I can do now.
Any clues ?
Note: I'm talking about glBufferData but I have the same issue with glBindFramebuffer, very slow with big textures. I assume this is all the same topic.
Thanks

It is normally a costly call to do as glBufferData involves CPU to GPU transfer.
But the logic behind Renderer::drawBatchedQuads is to flush the quads that have been buffered in a temporary array. The more quads you have to render, the more data have to be transferred.
Since the quads properties (positions, texture, colors) are likely to change each frame, a CPU to GPU transfer is required every frame as hinted by the flag GL_DYNAMIC_DRAW.
According to specs:
GL_DYNAMIC_DRAW: The data store contents will be modified repeatedly and used many times as the source for GL drawing command.
There are possible alternatives to glBufferData such as glMapBuffer or glBufferSubData that could be used for comparison.

GPU Context Switch

I have a program which renders first to a texture, then pass the texture to the compute shader for processing, then renders the output result to the screen via a textured full screen quad.
I've read in the nVidia's programming guide for compute shaders that every time you dispatch a compute shader, it initiates a GPU device context switch, which should not be done quite often.
I'm very confused right now. The way I see it, in my rendering pipeline the GPU switches contexts twice. Right? Once during the first dispatch call, the next time when I render my full screen quad normally.
If this is correct, then I can avoid one switch by reorganizing my code like this. First, render to a texture. Second, do the processing on the compute shader. Then, IN THE NEXT FRAME, render the result, then (still in the next frame) render all updates to texture, do processing on compute shader... So basically every start of the frame I render the results of the last frame (the first frame will be an exception). Then there will only be one context switch, right?
But then the GPU will still have to do a context switch between the frames, right? So the two versions of my rendering pipeline all have two context switches. There would be no difference in performance. Am I correct?
Any help would be appreciated.

Context switch introduces a small hit, but in your case it would be pretty negligible, so you can safely switch between compute and render pipeline several times in the same frame without having to worry about it.
In lot of modern games there's more than 2 switches in the same pipeline (graphics pipeline for render, compute shader for light, pixelshader for fxaa...) and they still run fine.

OpenGL ES rendering to user-space memory

I need to implement off-screen rendering to texture on an ARM device with PowerVR SGX hardware.
Everything is done (pixelbuffers and OpenGL ES 2.0 API were used). The only problem unsolved is very slow glReadPixels function.
I'm not an expert in OpenGL ES, so I'm asking community: is it possible to render textures directly into user-space memory? Or may be there is some way to get hardware address of texture's memory region? Some other technique (EGL extensions)?
I don't need an universal solution, just working one for PowerVR hardware.
Update: A little more information on 'slow function glReadPixels'. Copy 512x512 RGB texture data to CPU's memory:
glReadPixels(0, 0, WIDTH, HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, &arr) takes 210 ms,
glReadPixels(0, 0, WIDTH, HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE, &arr) takes 24 ms (GL_BGRA is not standard for glReadPixels, it's PoverVR extension),
memcpy(&arr, &arr2, WIDTH * HEIGHT * 4) takes 5 ms
In case of bigger textures, differences are bigger too.

Solved.
The way how to force OpenVR hardware render into user-allocated memory:
http://processors.wiki.ti.com/index.php/Render_to_Texture_with_OpenGL_ES#Pixmaps
An example, how to use it:
https://gforge.ti.com/gf/project/gleslayer/
After all of this I can get rendered image as faster as 5 ms.

When you call opengl functions, you're queuing commands in a render queue. Those commands are executed by the GPU asynchronously. When you call glReadPixels, the cpu must wait the gpu to finish its rendering. So the call might be waiting for that draw to finish. On most hardware ( at least those I work on ), the memory is shared by the cpu and the gpu, so the read pixel should not be that slow if the rendering is done.
If you can wait the result or deferred it to the next frame, you might not see that delay anymore

Frame buffer objects are what you are looking for. They are supported on OpenGL ES, and on PowerVr-SGX
EDIT:
Keep in mind that GPU/CPU hardware is incredibly optimized towards moving data in one direction from CPU side to GPU side. The backpath from GPU to CPU is often much slower (its just not a priority to spend hardware resources on). So what ever technique you use (eg FBO/getTexImage) you're going to run against this limit.

glFlush() takes very long time on window with transparent background

I used the code from How to make an OpenGL rendering context with transparent background? to create a window with transparent background. My problem is that the frame rate is very low - I have around 20 frames/sec even when I draw one quad(made from 2 triangles). I tried to find out why and glFlush() takes around 0.047 seconds. Do you have any idea why? Same thing is rendered in a window that does not have transparent background at 6000 fps(when I remove 60 fps limitation). It also takes one core to 100%. I test it on a Q9450#2.66GHz with ATI Radeon 4800 using Win7.

I think you can't get good performances this way, In the example linked there is the following code
void draw(HDC pdcDest)
{
assert(pdcDIB);
verify(BitBlt(pdcDest, 0, 0, w, h, pdcDIB, 0, 0, SRCCOPY));
}
BitBlt is a function executed on the processor, whereas the OpenGL functions are executed by the GPU. So the rendered data from the GPU as to crawl back to the main memory, and effectively the bandwidth from the GPU to the CPU is somewhat limited (even more because data as to go back there once BitBlt'ed).
If you really want transparent window with rendered content, you might want to look at Direct2D and/or Direct3D, maybe there is some way to do that without the performance penalty of data moving.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

glTexSubImage2D extremely slow on Intel video card - performance

Related

About the meaning of glInvalidateFramebuffer

glBufferData very slow with big textures (sprites sheets) in Cocos2d-x 3.3

GPU Context Switch

OpenGL ES rendering to user-space memory

glFlush() takes very long time on window with transparent background

Categories

Resources