I need to implement off-screen rendering to texture on an ARM device with PowerVR SGX hardware.
Everything is done (pixelbuffers and OpenGL ES 2.0 API were used). The only problem unsolved is very slow glReadPixels function.
I'm not an expert in OpenGL ES, so I'm asking community: is it possible to render textures directly into user-space memory? Or may be there is some way to get hardware address of texture's memory region? Some other technique (EGL extensions)?
I don't need an universal solution, just working one for PowerVR hardware.
Update: A little more information on 'slow function glReadPixels'. Copy 512x512 RGB texture data to CPU's memory:
glReadPixels(0, 0, WIDTH, HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, &arr) takes 210 ms,
glReadPixels(0, 0, WIDTH, HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE, &arr) takes 24 ms (GL_BGRA is not standard for glReadPixels, it's PoverVR extension),
memcpy(&arr, &arr2, WIDTH * HEIGHT * 4) takes 5 ms
In case of bigger textures, differences are bigger too.
Solved.
The way how to force OpenVR hardware render into user-allocated memory:
http://processors.wiki.ti.com/index.php/Render_to_Texture_with_OpenGL_ES#Pixmaps
An example, how to use it:
https://gforge.ti.com/gf/project/gleslayer/
After all of this I can get rendered image as faster as 5 ms.
When you call opengl functions, you're queuing commands in a render queue. Those commands are executed by the GPU asynchronously. When you call glReadPixels, the cpu must wait the gpu to finish its rendering. So the call might be waiting for that draw to finish. On most hardware ( at least those I work on ), the memory is shared by the cpu and the gpu, so the read pixel should not be that slow if the rendering is done.
If you can wait the result or deferred it to the next frame, you might not see that delay anymore
Frame buffer objects are what you are looking for. They are supported on OpenGL ES, and on PowerVr-SGX
EDIT:
Keep in mind that GPU/CPU hardware is incredibly optimized towards moving data in one direction from CPU side to GPU side. The backpath from GPU to CPU is often much slower (its just not a priority to spend hardware resources on). So what ever technique you use (eg FBO/getTexImage) you're going to run against this limit.
Related
function render(time, scene) {
if (useFramebuffer) {
gl.bindFramebuffer(gl.FRAMEBUFFER, scene.fb);
}
gl.viewport(0.0, 0.0, canvas.width, canvas.height);
gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
gl.enable(gl.DEPTH_TEST);
renderScene(scene);
gl.disable(gl.DEPTH_TEST);
if (useFramebuffer) {
gl.bindFramebuffer(gl.FRAMEBUFFER, null);
copyFBtoBackBuffer(scene.fb);
}
window.requestAnimationFrame(function(time) {
render(time, scene);
});
}
I'm not able to share the exact code I use, but a mockup will illustrate my point.
I'm rendering a fairly complex scene and am also doing some ray tracing in WebGL. I've noticed two very strange performance issues.
1) Inconsistent frame rate between runs.
Sometimes, when the page starts the first ~100 frames render in 25ms, then it suddenly drops to 45ms, without any user input or changes to the scene. I'm not updating any buffer or texture data in a frame, only shader uniforms. When this happens the GPU memory stays constant.
2) Rendering to the default framebuffer is slower than using an extra pass.
If I render to a created frambuffer and then blit to the HTML canvas (the default framebuffer), I get 10% performance increase. So in the code snippet, if useFramebuffer == true performance is gained, which seems very counter intuitive.
Edit 1:
Due to changes in requirements, the scene will always be rendered to a framebuffer and then copied to the canvas. This makes question 2) a non-issue.
Edit 2:
System specs of the PC this was tested on:
OS: Win 10
CPU: Intel i7-7700
Nvidia GTX 1080
RAM: 16 GB
Edit 3:
I profiled the scene using chrome:tracing. The first ~100-200 frames render 16.6ms.
Then it starts dropping frames.
I'll try profiling everything with timer queries, but I'm afraid that each render actually takes the same amount of time, and buffer swap will randomly take twice as long.
Another thing I noticed is that this starts happening when I use Chrome for a while. When the problems start, clearing the browser cache or killing the Chrome process don't help, only a system reboot does.
Is it possible that Chrome is throttling the GPU on a whim?
P.S.
The frame times changed because of some optimizations, but the core problem persists.
Assuming the device support the GL_OES_depth_texture extension, is there any difference in terms of performance or memory consumption in attaching a storage or a texture to a FBO ?
Your post is tagged with OpenGLES 2.0 which most likely means you're talking about mobile.
Many Android mobile GPUs and all iOS GPUs are based on Tile Based Deferred Renderers - in this design, the rendering is all done to small (e.g. 32x32) tiles using special fast on-chip memory. In a typical rendering pass, with correct calls to glClear and glDiscardFramebufferEXT, there's no need for the device to ever have to copy depth buffer out from the on-chip memory into storage.
However, if you're using a depth texture, then this copy is unavoidable. The cost of transferring a screen-sized depth texture from on-chip memory into a texture is significant. However, I'd expect the rendering costs of your draw calls during the render pass to be unaffected.
In terms of memory usage, it's a bit more speculative. It's possible that a clever driver might not need to allocate any memory at all for a depth buffer on a TBDR GPU if you're not using a depth texture and you're using glClear and glDiscardFramebufferEXT correctly because at no point does your depth buffer have to be backed by any storage. Whether drivers actually do that is internal to the driver's implementation and you would have to ask the driver authors (Apple/Imagination Technologies/ARM, etc).
Finally, it may be the case that the depth buffer format has to undergo some reconfiguration to be usable as a depth texture which could mean it uses more memory and affect efficiency. I think that's unlikely though.
TLDR: Don't use a depth texture unless you actually need to, but if you do need one, then I don't think it will impact your rendering performance too much. The main cost is in the bandwidth of copying the depth data about.
My video card is Mobile Intel 4 Series. I'm updating a texture with changing data every frame, here's my main loop:
for(;;) {
Timer timer;
glBindTexture(GL_TEXTURE2D, tex);
glBegin(GL_QUADS); ... /* draw textured quad */ ... glEnd();
glTexSubImage2D(GL_TEXTURE2D, 0, 0, 0, 512, 512,
GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, data);
swapBuffers();
cout << timer.Elapsed();
}
Every iteration takes 120ms. However, inserting glFlush before glTexSubImage2D brings the iteration time to 2ms.
The issue is not in the pixel format. I've tried the pixel formats BGRA, RGBA and ABGR_EXT together with the pixel types UNSIGNED_BYTE, BYTE, UNSIGNED_INT_8_8_8_8 and UNSIGNED_INT_8_8_8_8_EXT. The texture's internal pixel format is RGBA.
The order of calls matters. Moving the texture upload before the quad drawing, for example, fixes the slowness.
I also tried this on an GeForce GT 420M card, and it works fast there. My real app does have performance problems on non-Intel cards that are fixed by glFlush calls, but I haven't distilled those to a test case yet.
Any ideas on how to debug this?
One issue is that glTexImage2D performs a full reinitialization of the texture object. If only the data changes, but the format remains the same, use glTexSubImage2D to speed things up (just a reminder).
The other issue is, that despite its name the immediate mode, i.e. glBegin(…) … glEnd() the drawing calls are not synchronous, i.e. the calls return long before the GPU is done drawing. Adding a glFinish() will synchronize. But as well will do calls to anything that modifies data still required by queued operations. So in your case glTexImage2D (and glTexSubImage2D) must wait for the drawing to finish.
Usually it's best to do all volatile resource uploads at either the beginning of the drawing function, or during the SwapBuffers block in a separate thread through buffer objects. Buffer objects have been introduced for that very reason, to allow for asynchronous, yet tight operation.
I assume you're actually using that texture for one or more of your quads?
Uploading textures is one of the most expensive operations possible. Since your texture data changes every frame, the upload is unavoidable, but you should try to do it when the texture isn't in use by shaders. Remember that glBegin(GL_QUADS); ... glEnd(); doesn't actually draw quads, it requests that the GPU render the quads. Until the rendering completes, the texture will be locked. Depending on the implementation, this might cause the texture upload to wait (ala glFlush), but it could also cause the upload to fail, in which case you've wasted megabytes of PCIe bandwidth and the driver has to retry.
It sounds like you already have a solution: upload all new textures at the beginning of the frame. So what's your question?
NOTE: Intel integrated graphics are horribly slow anyway.
When you make a Draw Call ( glDrawElements, other ), the driver simply add this call in a buffer, and let the GPU consume these commands when it can.
If this buffer had to be consumed entirely at glSwapBuffers, this would mean that the GPU would be idle after that, waiting for you to send new commands.
Drivers solve this by letting the GPU lag one frame behind. This is the first reason why glTexSubImage2D blocks : the driver waits for the GPU not to use it anymore (in the previous frame) to begin the transfer, so that you never get half-updated data.
The other reason is that glTexSubImage2D is synchronous. Il will also block during the whole transfer.
You can solve the first issue by keeping 2 textures : one for the current frame, one for the previous frame. Upload the texture in the former, but draw with the latter.
You can solve the second issue by using a GL_TEXTURE_BUFFER Buffer Object, which allows asynchronous transfers.
In your case, I suspect that calling glTexSubImage2D just before glSwapBuffer adds an extra synchronization in the driver, whereas drawing the quad just before glSwapBuffer simply appends the command in the buffer. 120ms is probably a driver bug, though : even an Intel GMA doesn't need 120ms to upload a 512x512 texture.
I used the code from How to make an OpenGL rendering context with transparent background? to create a window with transparent background. My problem is that the frame rate is very low - I have around 20 frames/sec even when I draw one quad(made from 2 triangles). I tried to find out why and glFlush() takes around 0.047 seconds. Do you have any idea why? Same thing is rendered in a window that does not have transparent background at 6000 fps(when I remove 60 fps limitation). It also takes one core to 100%. I test it on a Q9450#2.66GHz with ATI Radeon 4800 using Win7.
I think you can't get good performances this way, In the example linked there is the following code
void draw(HDC pdcDest)
{
assert(pdcDIB);
verify(BitBlt(pdcDest, 0, 0, w, h, pdcDIB, 0, 0, SRCCOPY));
}
BitBlt is a function executed on the processor, whereas the OpenGL functions are executed by the GPU. So the rendered data from the GPU as to crawl back to the main memory, and effectively the bandwidth from the GPU to the CPU is somewhat limited (even more because data as to go back there once BitBlt'ed).
If you really want transparent window with rendered content, you might want to look at Direct2D and/or Direct3D, maybe there is some way to do that without the performance penalty of data moving.
I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?
The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.
It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.
Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).
If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.